Bug #10411: PG stuck incomplete after failed node - Ceph - Ceph

Actions

Copy link

Bug #10411

closed

PG stuck incomplete after failed node

Added by Brian Rak over 9 years ago. Updated about 7 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Yesterday, I was in the process of expanding the number of PGs in one of our pools. While I was doing this, one of the disks in an OSD failed (probably due to the high load of the cluster at that point). I removed this OSD from the pool, and let it rebuild, however I ended up with with 2 pgs stuck down and peering.

This is the relevant 'ceph health detail' output

pg 3.44c is stuck inactive since forever, current state down+peering, last acting [51,85]
pg 14.441 is stuck inactive since forever, current state down+peering, last acting [51,85]
pg 3.44c is stuck unclean since forever, current state down+peering, last acting [51,85]
pg 14.441 is stuck unclean since forever, current state down+peering, last acting [51,85]
pg 14.441 is down+peering, acting [51,85]
pg 3.44c is down+peering, acting [51,85]

I can't seem to figure out how to correct this. I've tried:

'ceph osd out' both active OSDs, then putting them back in
ceph pg repair 3.44c
Restarting both OSDs (51, 85)
Restarting every OSD in the cluster
The patch from #10250 (I only installed this on the two relevant OSDs, did this need to be deployed cluster-wide?)

I've attached the debug log from one of the OSDs, passed through | grep 3.44c

Aside from the two nodes I upgraded, the rest of the cluster is v0.87

I can provide additional information if necessary, however I do not really want to post any information about the IP addresses of our nodes on a public bug tracker.

I'm on IRC as 'devicenull' if that would be any help of debugging this.

Files

Download all files

3.44c (3.12 MB) 3.44c		Brian Rak, 12/22/2014 07:52 AM
query (5.45 MB) query		Brian Rak, 12/23/2014 08:20 AM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #10411

PG stuck incomplete after failed node

Updated by Brian Rak over 9 years ago

Updated by Brian Rak over 9 years ago

Updated by Brian Rak over 9 years ago

Updated by Brian Rak over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Brian Rak over 9 years ago

Updated by Jifeng Yin almost 9 years ago

Updated by Brian Rak about 8 years ago

Updated by Anonymous about 7 years ago