Bug #22561
openPG stuck during recovery, requires OSD restart
0%
Description
We are sometimes encountering issues with PGs getting stuck in recovery.
For example, we ran some stress tests with bluestore and EC on our test cluster and we've randomly swapped disks between servers and turned off servers during heavy load. This resulted in a quite bad state and a few PGs got stuck in the recovery_wait state:
The affected PGs reported unfound objects and the recovery state looked like this:
"recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2018-01-04 00:31:01.884650", "might_have_unfound": [ { "osd": "5(2)", "status": "already probed" }, { "osd": "5(5)", "status": "already probed" }, { "osd": "9(3)", "status": "already probed" }, { "osd": "23(2)", "status": "already probed" }, { "osd": "26(5)", "status": "already probed" }, { "osd": "28(5)", "status": "already probed" }, { "osd": "31(1)", "status": "already probed" }, { "osd": "32(2)", "status": "querying" }, { "osd": "41(4)", "status": "already probed" }, { "osd": "51(1)", "status": "already probed" }, { "osd": "59(2)", "status": "already probed" }, { "osd": "71(5)", "status": "already probed" } ], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "MIN", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "recovery_ops": [], "read_ops": [] } }, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": false, "scrubber.state": "INACTIVE", "scrubber.start": "MIN", "scrubber.end": "MIN", "scrubber.subset_last_update": "0'0", "scrubber.deep": false, "scrubber.seed": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": [] } }, { "name": "Started", "enter_time": "2018-01-04 00:30:16.663165" }
It never started recovery and it apparently never tried to contact OSD 32 (which had the missing objects). Couldn't find anything useful in the logs, but unfortunately didn't save them :/
The only way to get recovery started was to restart OSD 32 which had the unfound objects. It recovered fine after that.
This affected ~20 PGs out of ~5000.
We tried to wait for > 30 minutes on one of the PGs and the state didn't change with the cluster completely idle.
We had to restart the OSDs it was trying to contact, not the primary of the PG.
We have also seen this happen in a production cluster, but during backfill and with no unfound objects. This happened after a bigger network outage.
ceph 12.2.2
Files
Updated by Patrick Donnelly over 6 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
Updated by Josh Durgin over 6 years ago
Was OSD 32 running at the time? It sounds like correct behavior if OSD 32 was not reachable. It might have been marked down for some reason automatically, and stopped itself.
If you want to reproduce, logs with debug ms = 1 and debug osd = 20 would let us figure out what happened.
Updated by Paul Emmerich over 6 years ago
OSD 32 was running and actively serving client IO.
Updated by Paul Emmerich over 5 years ago
I've just encountered this again with about 20 OSDs being non-responsive like this. Restarting the OSDs in that state helped. Again, the other OSD was up and running, nothing odd going on.
The 20 OSDs were spread across 3 different servers in the cluster which is a mix of 12.2.1 and 12.2.0. All affected OSDs were 12.2.1 and had an uptime of almost exactly 1 year (364 days).
No 12.2.0 OSDs with the same uptime were affected. No 12.2.1 OSDs with a shorter uptime were affected.