Project

General

Profile

Bug #3905

incomplete & stale (lost?) PGs

Added by Faidon Liambotis about 11 years ago. Updated about 11 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I added a bunch of new OSDs into my Ceph cluster (0.56.1 on Ubuntu 12.04 LTS) about 72h ago. Simultaneously, I marked most of the old OSDs as "out", as I want to completely replace the hardware of my Ceph cluster.

Until today, the recovery process was running well. Then at some point, random OSDs started being marked as down and up again -- this may or may not have to do with #3904 which was observed at the time. Some of them were complaining for op_tp heartbeat which was set at 7200 and then increased to 28800. After a few hours (4-5) the cluster stabilized again, with all the OSDs being marked up.

However, I now see 1 incomplete and 22 stale PGs:

2013-01-24 02:16:39.900827 mon.0 [INF] pgmap v1780945: 16952 pgs: 13 active, 7729 active+clean, 7050 active+remapped+wait_backfill, 79 active+degraded+wait_backfill, 3 peering, 866 active+remapped, 300 active+remapped+backfilling, 288 active+degraded, 6 active+degraded+backfilling, 517 active+degraded+remapped+wait_backfill, 39 stale+active+remapped, 7 active+recovery_wait+remapped, 4 remapped+peering, 1 incomplete, 27 active+degraded+remapped+backfilling, 7 stale+remapped+peering, 16 stale+active+degraded+remapped; 25005 GB data, 54874 GB used, 184 TB / 238 TB avail; 28083195/149909025 degraded (18.733%)

Attached are pg dump, osd dump, osd map, crushmap and pg query for the incomplete PG. Query on the stale PGs results in "pgid currently maps to no osd" which is a bit worrying...

Note there was no read or write traffic to the cluster during recovery and there is none now -- we've left it alone to quietly recover to the new hardware, but it seems it wasn't enough :)

pgquery-3.27d9 (33.9 KB) Faidon Liambotis, 01/23/2013 06:34 PM

pgdump (3.38 MB) Faidon Liambotis, 01/23/2013 06:34 PM

osddump (252 KB) Faidon Liambotis, 01/23/2013 06:34 PM

osdmap (354 KB) Faidon Liambotis, 01/23/2013 06:34 PM

crushmap (8.55 KB) Faidon Liambotis, 01/23/2013 06:34 PM

osdtree (3.79 KB) Faidon Liambotis, 01/24/2013 11:18 AM

History

#1 Updated by Faidon Liambotis about 11 years ago

#2 Updated by Sage Weil about 11 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Greg Farnum about 11 years ago

Sounds like a combination of crush map and rules that aren't behaving well together — "incomplete" means the PG doesn't have enough OSDs to go active, and a PG not mapping to any OSDs points to a similar problem. Just a quick thought on where to look.

#4 Updated by Faidon Liambotis about 11 years ago

#5 Updated by Faidon Liambotis about 11 years ago

Due to some other issues and after a chat with Sage, I restarted all of my osds and this disappeared since. So I'm afraid I don't have any more data to add to this report.

#6 Updated by Sage Weil about 11 years ago

  • Status changed from New to Can't reproduce

This appears to be something that was triggered and exacerbated by now-fixed issues. Until we can trigger it, I'm inclined to mark it can't reproduce. Make sure to let us know if you see anything like it again!

Also available in: Atom PDF