Project

General

Profile

Bug #9614

PG stuck with remapped

Added by Guang Yang over 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In our pre-production cluster, we observed that the cluster starts backfilling even with OSD noout flag set when there is OSD daemon down.

 cluster ee14bc5e-5dad-4b1b-bb72-9a462497ad90
     health HEALTH_WARN 72 pgs backfilling; 248 pgs degraded; 393 pgs stuck unclean; recovery 1642621/5156878710 objects degraded (0.032%); 6/465 in osds are down; noout flag(s) set
     monmap e2: 3 mons at {mon01c008=10.214.140.208:6789/0,mon02c008=10.214.140.80:6789/0,mon03c008=10.214.141.16:6789/0}, election epoch 8, quorum 0,1,2 mon02c008,mon01c008,mon03c008
     osdmap e3325: 465 osds: 459 up, 465 in
            flags noout
      pgmap v2167202: 14848 pgs, 8 pools, 869 TB data, 447 Mobjects
            1235 TB used, 1297 TB / 2532 TB avail
            1642621/5156878710 objects degraded (0.032%)
               14453 active+clean
                   2 active+clean+scrubbing+deep
                 248 active+degraded
                  73 active+remapped
                  72 active+remapped+backfilling
  client io 19259 B/s rd, 5294 kB/s wr, 55 op/s

Health detail (partial):

pg 3.1f41 is active+remapped+backfilling, acting [333,217,324,280,348,2147483647,31,363,354,208,329]
pg 3.1f0e is active+remapped+backfilling, acting [80,370,224,414,103,131,2147483647,219,323,196,249]
pg 3.1ee7 is active+remapped+backfilling, acting [246,2147483647,335,2147483647,104,349,414,243,361,234,34]
pg 3.1ec1 is active+remapped+backfilling, acting [376,455,2147483647,338,365,218,387,209,242,222,424]
pg 3.1ea0 is active+remapped+backfilling, acting [303,144,204,7,216,174,377,172,2147483647,112,202]
pg 3.1e87 is active+remapped+backfilling, acting [0,2147483647,221,294,1,415,152,153,297,147,214]
pg 3.1e11 is active+remapped+backfilling, acting [310,15,297,2147483647,350,292,464,426,48,77,304]
pg 3.1baa is active+remapped+backfilling, acting [209,230,456,128,300,439,269,354,365,2147483647,379]

The observation is, for replication pool, they can be marked as active+degraded which is correct, however, for EC pool, the PGs are marked as active+remapped+backfilling, which is not the intension since it triggers massive data migration.

Ceph version:
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)

crush_plain (27.7 KB) Guang Yang, 09/29/2014 01:53 AM

ec_profile (107 Bytes) Guang Yang, 09/29/2014 01:53 AM

osd_dump (111 KB) Guang Yang, 09/29/2014 01:53 AM


Related issues

Copied to Ceph - Backport #12011: PG stuck with remapped Resolved 09/28/2014

History

#1 Updated by Guang Yang over 9 years ago

Another observation is that even the pg dump result for such PG:

3.1ad6  57327   0       0       0       117006534603    3001    3001    active+clean    2014-09-21 01:16:24.807222      3318'76861      3431:178145     [335,136,367,2147483647,325,458,15,295,452,52,443]      335     [335,136,367,2147483647,325,458,15,295,452,52,443]      335     523'2   2014-09-02 12:45:41.785393      0'0     2014-09-01 11:53:41.955553

Even there is a hole, the PG is marked as active+clean.

#2 Updated by Loïc Dachary over 9 years ago

pg 3.1ee7 is active+remapped+backfilling, acting [246,2147483647,335,2147483647,104,349,414,243,361,234,34]

The 2147483647 here shows mapping failed. Is this something you expect ?

#3 Updated by Guang Yang over 9 years ago

Loic Dachary wrote:

[...]
The 2147483647 here shows mapping failed. Is this something you expect ?

As there is no OSD out, I would expect active+degraded for such PG, did I miss anything? It seems like a need like this - https://github.com/ceph/ceph/pull/2592 ?

#4 Updated by Loïc Dachary over 9 years ago

  • Assignee set to Loïc Dachary
  • Priority changed from High to Urgent

#5 Updated by Guang Yang over 9 years ago

Attaching CRUSH / EC profile / OSD dump.

#6 Updated by Guang Yang over 9 years ago

Guang Yang wrote:

Another observation is that even the pg dump result for such PG:
[...]

Even there is a hole, the PG is marked as active+clean.

This issue could be fixed by PR - https://github.com/ceph/ceph/pull/2592

#7 Updated by Guang Yang over 9 years ago

There are still two issues:
  1. Some PGs are stuck with active+remapped forever (for both replicated pool and EC pool).
    ceph pg 5.3f6 query | head -n 13
    { "state": "active+remapped",
      "epoch": 3465,
      "up": [
            237,
            254],
      "acting": [
            237,
            254,
            127],
      "actingbackfill": [
            "127",
            "237",
            "254"],
    

    -bash-4.1$ sudo ceph pg 3.22a query | head -n 40
    { "state": "active+remapped",
      "epoch": 3465,
      "up": [
            33,
            384,
            2147483647,
            170,
            224,
            372,
            125,
            189,
            352,
            2147483647,
            289],
      "acting": [
            33,
            384,
            2147483647,
            170,
            224,
            372,
            125,
            189,
            352,
            118,
            289],
      "actingbackfill": [
            "33(0)",
            "118(9)",
            "125(6)",
            "170(3)",
            "189(7)",
            "224(4)",
            "289(10)",
            "352(8)",
            "372(5)",
            "384(1)"],
    
  2. After the PG is marked as active+remapped, it starts backfilling even though there is no OSD out.

#8 Updated by Guang Yang over 9 years ago

  • Subject changed from Start backfilling with noout flag set to PG stuck with remapped

#9 Updated by Loïc Dachary over 9 years ago

could you attach the full output of pg query 3.1ee7 please ? And also the ceph osd tree would help to get an idea why mapping fails (you may want to anonymize host names). It's not directly related but I'm looking for inspiration ;-) It would help to have the full osdmaps (as many as possible). If you can't post it publicly for some reason, could you mail it to ?

#10 Updated by Loïc Dachary over 9 years ago

  • Status changed from New to 12
<loicd> sjusthm: I thought http://tracker.ceph.com/issues/9614 was urgent but I don't fully understand what should happen in this case. What do you think ? 
<sjusthm> loicd: which case?
<sjusthm> active+clean with a hole is obviously wrong
<sjusthm> the logic checking on the size of acting is probably using acting.size() instead of checking for valid acting osds
<sjusthm> that one should be easy to find
<sjusthm> I don't know why it is backfilling with noout
<sjusthm> that seems odd
<loicd> active+remapped+backfilling is expected when an OSD is ITEM_NONE ? 
<loicd> assuming it's not noout I mean
<sjusthm> those two things are not necessarily related
<sjusthm> ITEM_NONE means that the acting set has a hole there
<sjusthm> that might be because crush is trying to place a valid osd there
<sjusthm> but the osd needs to be backfilled
<sjusthm> so the primary requested a pg_temp mapping with that osd removed while it backfills
<sjusthm> that much is normal
<loicd> ok
<sjusthm> the odd part is that if the osd is marked down but in, crush should not have tried to replace it, it should have ITEM_NONE
<sjusthm> which should mean no backfill
<sjusthm> that is, ITEM_NONE in the up set
<loicd> that gives me enough to keep digging sjusthm, thank you :-)
<sjusthm> we will only backfill osd X for position Y if the up set (the crush output) has osd X in position Y
<sjusthm> during the backfill, we might have an *acting* set with ITEM_NONE for position Y since osd X is not ready yet
<sjusthm> due to the primary setting a pg_temp
<sjusthm> but the up set would still have osd X in position Y
<loicd> http://tracker.ceph.com/issues/9614#note-7 shows the up set has two item_none 
<sjusthm> yeah, some of that is just due to the crush heiarchy being dumb
<sjusthm> but he said that they had an osd go down and cause backfill with no out set
<sjusthm> that's odd
<sjusthm> it means a new osd got mapped in the dead osd's spot without it having being set out
<sjusthm> so that's problem 2
<sjusthm> and the active+clean with two ITEM_NONE's is problem 1
<sjusthm> probably completely distinct
<loicd> I don't see how the crush hierarchy is dumb ? It looks to me that it has enough hosts and enough osd per host to satisfy the ruleset. What am I not seeing ? 
<sjusthm> oh, that should be fine then
<loicd> ~50 hosts
<loicd> ~500 osds, 10 per host
<sjusthm> I guess he expected active+remapped due to noout then
<loicd> yes
<loicd> I'll keep digging then. 
<sjusthm> crush in indep mode might be incorrectly mapping new osds
<loicd> that would be a surprise
<sjusthm> for there to be backfill, the up set has to change
<sjusthm> to include a new osd
<sjusthm> and that shouldn't be possible with noout
<sjusthm> *shrug*
<sjusthm> loicd: oh, unless a new osd came up
<sjusthm> or there was another crush change
<sjusthm> could totally happen then
-*- loicd thinking
<sjusthm> loicd: you'll want to get an OSDMap
<sjusthm> then try to manipulate it into the initial position
<sjusthm> then mark an osd down
<sjusthm> and see if you can get the up set to change
<sjusthm> (initial position being one with no pg_temp and all osds up+in)
<sjusthm> oh, and remapped can be ok as well
<sjusthm> if the primary happens to know about an osd which can fill in for a missing shard without backfill, it'll just include it in the pg_temp
<loicd> it's going to be interesting
<sjusthm> loicd: I think the active+clean bug is a real one
<sjusthm> the other one can be explained by another crush heirarchy change at the same time
<sjusthm> if they can reproduce, you can get OSDMaps from before and after
<sjusthm> actually, they might still have them
<sjusthm> have them dump all of the osdmaps they have
<sjusthm> that'll tell you exactly what was going on
<dmick> anyone have a redhat dns server addr handy?
<sjusthm> loicd: oh, and it should be active+degraded+remapped, that'll be the same bug as the active+clean one
<loicd> https://github.com/ceph/ceph/pull/2592 looks like a good reason why it is not degraded
<loicd> sjusthm: don't you think ? 
<sjusthm> loicd: yeah, but as sage pointed out, I think there are a bunch of them
<loicd> ok
<sjusthm> there are a bunch of checks that probably use acting.size() when they should probably just use actingset
<sjusthm> which is the set<pg_shard_t> version of acting
<sjusthm> oh, some of the checks use actingset, that's good
<sjusthm> you probably want to audit uses of acting\.size9)
<sjusthm> *acting\.size()

#11 Updated by Loïc Dachary over 9 years ago

  • Assignee changed from Loïc Dachary to Guang Yang

It looks like you are on the right track :-)

#12 Updated by Guang Yang over 9 years ago

Thanks Loic for the following up.

After talking to other engineers, the backfilling seems like due to he removed OSD from the CRUSH map (for testing purpose) which I was not aware of. So that it should be expected having backfilling in that case.

The only problem left seems to be: why the PG stuck at active+remapped, if I understand correctly, they should be active+degraded in such case?

#13 Updated by Samuel Just over 9 years ago

  • Status changed from 12 to 7

#14 Updated by Guang Yang over 9 years ago

The original fix was not clean, just added a new pull request: https://github.com/ceph/ceph/pull/2711

#15 Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Pending Backport

#16 Updated by Nathan Cutler almost 9 years ago

  • Backport set to firefly
  • Regression set to No

#18 Updated by Nathan Cutler over 8 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF