Bug #22561
openPG stuck during recovery, requires OSD restart
0%
Description
We are sometimes encountering issues with PGs getting stuck in recovery.
For example, we ran some stress tests with bluestore and EC on our test cluster and we've randomly swapped disks between servers and turned off servers during heavy load. This resulted in a quite bad state and a few PGs got stuck in the recovery_wait state:
The affected PGs reported unfound objects and the recovery state looked like this:
"recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2018-01-04 00:31:01.884650", "might_have_unfound": [ { "osd": "5(2)", "status": "already probed" }, { "osd": "5(5)", "status": "already probed" }, { "osd": "9(3)", "status": "already probed" }, { "osd": "23(2)", "status": "already probed" }, { "osd": "26(5)", "status": "already probed" }, { "osd": "28(5)", "status": "already probed" }, { "osd": "31(1)", "status": "already probed" }, { "osd": "32(2)", "status": "querying" }, { "osd": "41(4)", "status": "already probed" }, { "osd": "51(1)", "status": "already probed" }, { "osd": "59(2)", "status": "already probed" }, { "osd": "71(5)", "status": "already probed" } ], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "MIN", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "recovery_ops": [], "read_ops": [] } }, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": false, "scrubber.state": "INACTIVE", "scrubber.start": "MIN", "scrubber.end": "MIN", "scrubber.subset_last_update": "0'0", "scrubber.deep": false, "scrubber.seed": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": [] } }, { "name": "Started", "enter_time": "2018-01-04 00:30:16.663165" }
It never started recovery and it apparently never tried to contact OSD 32 (which had the missing objects). Couldn't find anything useful in the logs, but unfortunately didn't save them :/
The only way to get recovery started was to restart OSD 32 which had the unfound objects. It recovered fine after that.
This affected ~20 PGs out of ~5000.
We tried to wait for > 30 minutes on one of the PGs and the state didn't change with the cluster completely idle.
We had to restart the OSDs it was trying to contact, not the primary of the PG.
We have also seen this happen in a production cluster, but during backfill and with no unfound objects. This happened after a bigger network outage.
ceph 12.2.2
Files