Project

General

Profile

Bug #10411

PG stuck incomplete after failed node

Added by Brian Rak over 9 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Yesterday, I was in the process of expanding the number of PGs in one of our pools. While I was doing this, one of the disks in an OSD failed (probably due to the high load of the cluster at that point). I removed this OSD from the pool, and let it rebuild, however I ended up with with 2 pgs stuck down and peering.

This is the relevant 'ceph health detail' output

pg 3.44c is stuck inactive since forever, current state down+peering, last acting [51,85]
pg 14.441 is stuck inactive since forever, current state down+peering, last acting [51,85]
pg 3.44c is stuck unclean since forever, current state down+peering, last acting [51,85]
pg 14.441 is stuck unclean since forever, current state down+peering, last acting [51,85]
pg 14.441 is down+peering, acting [51,85]
pg 3.44c is down+peering, acting [51,85]

I can't seem to figure out how to correct this. I've tried:

  • 'ceph osd out' both active OSDs, then putting them back in
  • ceph pg repair 3.44c
  • Restarting both OSDs (51, 85)
  • Restarting every OSD in the cluster
  • The patch from #10250 (I only installed this on the two relevant OSDs, did this need to be deployed cluster-wide?)

I've attached the debug log from one of the OSDs, passed through | grep 3.44c

Aside from the two nodes I upgraded, the rest of the cluster is v0.87

I can provide additional information if necessary, however I do not really want to post any information about the IP addresses of our nodes on a public bug tracker.

I'm on IRC as 'devicenull' if that would be any help of debugging this.

3.44c (3.12 MB) Brian Rak, 12/22/2014 07:52 AM

query (5.45 MB) Brian Rak, 12/23/2014 08:20 AM

History

#1 Updated by Brian Rak over 9 years ago

I also manged to get the output from 'ceph pg 3.44c query'. It's quite long.

#2 Updated by Brian Rak over 9 years ago

So, this actually started causing some major issues for the cluster. It seems Ceph kept trying to heal these PGs, which would cause the OSD they were on to stop responding to any other requests. This kept blocking reads and writes from other pools.

I ended up having to stop the OSDs that were hosting these PGs, which has mostly isolated the issue.

#3 Updated by Brian Rak about 9 years ago

For lack of a better thing to try:

# ceph health detail
HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck unclean; 37 requests are blocked > 32 sec; 6 osds have slow requests; 1 near full osd(s); noout flag(s) set
pg 3.44c is stuck inactive since forever, current state incomplete, last acting [85,80]
pg 14.441 is stuck inactive since forever, current state incomplete, last acting [85,80]
pg 3.44c is stuck unclean since forever, current state incomplete, last acting [85,80]
pg 14.441 is stuck unclean since forever, current state incomplete, last acting [85,80]
pg 14.441 is incomplete, acting [85,80]
pg 3.44c is incomplete, acting [85,80]

# ceph pg force_create_pg 3.44c
pg 3.44c now creating, ok
# ceph pg force_create_pg 14.441
pg 14.441 now creating, ok

# ceph health detail
HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck unclean; 54 requests are blocked > 32 sec; 5 osds have slow requests; 1 near full osd(s); noout flag(s) set
pg 3.44c is stuck inactive since forever, current state incomplete, last acting [85,80]
pg 14.441 is stuck inactive since forever, current state incomplete, last acting [85,80]
pg 3.44c is stuck unclean since forever, current state incomplete, last acting [85,80]
pg 14.441 is stuck unclean since forever, current state incomplete, last acting [85,80]
pg 14.441 is incomplete, acting [85,80]
pg 3.44c is incomplete, acting [85,80]
  • noout is set so I can stop 85 and 80 and have the rest of the cluster work properly. If I don't do this, the two broken pgs get reassigned to other OSDs, which break those OSDs.
  • OSDs 85 and 80 have no connectivity issues I can see. I'm watching the traffic in tcpdump, and see no packets getting lost nor retransmitted. They can ping each other just fine

#4 Updated by Brian Rak about 9 years ago

I did manage to get rid of these PGs... I had a few other OSDs that had empty 3.44c_head and 14.441_head directories. I stopped those OSDs, removed the directories, and restarted them.

Then I restarted 85 and 80, and did 'ceph pg force_create_pg 3.44c' again. The PG got stuck creating, so I restarted osd 85 again. This seems to have worked, the PGs are no longer listed in the health warnings.

Sorry I couldn't gather more info here, but this was causing significant disruption to our cluster.

#5 Updated by Samuel Just about 9 years ago

  • Status changed from New to Can't reproduce

#6 Updated by Brian Rak about 9 years ago

Ok, so for future reference..

What additional information would you need to be able to reproduce this? AFAIK there's no good guide as to what information should be included in a bug report.

This was a pretty terrible bug for us (as not only was there data loss, but our entire cluster broke because of it)... having this closed as can't reproduce is pretty sad :(

#7 Updated by Jifeng Yin almost 9 years ago

Hi Brian,

I face the same situation, and your solution fix our problem eventually. Huge Thanks to you!

I think the problem is that ceph doesn't offer an easy way to deal with lost pgs.

There is a ticket(#10098) to track this. Hope it be implemented soon.

Thanks,

#8 Updated by Brian Rak about 8 years ago

Encountered this again, on an entirely different cluster. Same resolution steps worked:

  • ceph pg 5.3d2 query
  • SSH into all the probing OSDs
  • Look for '/var/lib/ceph/osd/ceph-*/current/5.3d2_head'
  • Shut down all the OSDs where that directory exists
  • Move that directory anywhere else (could rm too, moving is just for safety)
  • Restart the OSDs you shut down
  • ceph pg 5.3d2 query
    • Ceph should report 'Error ENOENT: i don't have pgid 5.3d2'
  • ceph pg force_create_pg 5.3d2
  • Restart the set of probing OSDs again

At this point, Ceph should recreate the PG within 30s or so, and you should be all set.

#9 Updated by Anonymous about 7 years ago

what about the data in pg 5.3d2...

Also available in: Atom PDF