Bug #10411
closedPG stuck incomplete after failed node
0%
Description
Yesterday, I was in the process of expanding the number of PGs in one of our pools. While I was doing this, one of the disks in an OSD failed (probably due to the high load of the cluster at that point). I removed this OSD from the pool, and let it rebuild, however I ended up with with 2 pgs stuck down and peering.
This is the relevant 'ceph health detail' output
pg 3.44c is stuck inactive since forever, current state down+peering, last acting [51,85]
pg 14.441 is stuck inactive since forever, current state down+peering, last acting [51,85]
pg 3.44c is stuck unclean since forever, current state down+peering, last acting [51,85]
pg 14.441 is stuck unclean since forever, current state down+peering, last acting [51,85]
pg 14.441 is down+peering, acting [51,85]
pg 3.44c is down+peering, acting [51,85]
I can't seem to figure out how to correct this. I've tried:
- 'ceph osd out' both active OSDs, then putting them back in
- ceph pg repair 3.44c
- Restarting both OSDs (51, 85)
- Restarting every OSD in the cluster
- The patch from #10250 (I only installed this on the two relevant OSDs, did this need to be deployed cluster-wide?)
I've attached the debug log from one of the OSDs, passed through | grep 3.44c
Aside from the two nodes I upgraded, the rest of the cluster is v0.87
I can provide additional information if necessary, however I do not really want to post any information about the IP addresses of our nodes on a public bug tracker.
I'm on IRC as 'devicenull' if that would be any help of debugging this.
Files