Bug #51194
PG recovery_unfound after scrub repair failed on primary
0%
Description
This comes from a mail I send to the ceph-users ML: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3LR2PC2KXVK66KG7U6ADGIC2IRGVQIAL/
The cluster here is running v14.2.20 and is used for RBD images. It is critical for our operations so appreciate any advice how to proceed.
PG 4.1904 went into state active+recovery_unfound+degraded+repair [1] during normal background scrubbing (but note that we have `osd scrub auto repair = true`).
The output of pg query is at https://pastebin.com/46XHVKps
The scrub detected a read error on the primary of this PG:
2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG] 4.1904 repair starts 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error
Here's the corresponding kernel log of the IO error on the primary:
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DR Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key : Medium Error [current] [descript Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense: Unrecovered read error Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88 00 00 00 00 02 ba 8c 0b 00 Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error, dev sdp, sector 1171967531
But it seems that the other 2 osds could not repair this failed read on the primary because they don't have the correct version of the object:
2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR] 4.1904 repair 0 missing, 1 inconsistent objects 2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR] 4.1904 repair 1 errors, 1 fixed 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster [DBG] pgmap v557097: 19456 pgs: 1 active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing, 19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141 objects unfound (0.000%)
There were no other IO errors or failures during this time window.
We compared the object version numbers with the osdmap epoch at the time the object went unfound -- indeed the osdmap was e3593555 when this PG was marked recovery_unfound. So this implies that the osds 14 and 344 have a newer version of the object, written while scrub/repair was ongoing:
2021-06-13 03:50:13.808204 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232629 : cluster [DBG] osdmap e3593554: 1224 total, 1222 up, 1222 in ... 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error ... 2021-06-13 04:04:50.359786 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232802 : cluster [DBG] osdmap e3593555: 1224 total, 1222 up, 1222 in ... 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 ... 2021-06-13 04:31:17.106383 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1233177 : cluster [DBG] osdmap e3593556: 1224 total, 1222 up, 1222 in
Could this be a race in the PG repair? Something like:
1. Repair starts, osd.951 has object read error when it is v 3592634'367863320.
2. Later on during repair, that object is updated by the client, bumping it to v 3593555'368312656
3. Repair completes a few minutes later, osd.951 asks his peers to recover the object at the older v 3592634'367863320 --> they don't have it anymore, object is marked lost.
How do we clean up from this? We are considering:
- run `ceph pg repair ..` again now and hope it repairs with the correct new version of the object.
- Stop osd.951 and let the data backfill to other osds, then repair again.
- Set osd.951's primary-affinity to zero so one of the other OSDs becomes primary, then repair again.
Or maybe we should just run `mark_unfound_lost revert`? Will it take v 3593555'368312656 from the peer OSDs ?
History
#1 Updated by Dan van der Ster over 2 years ago
To prevent the user IO from being blocked, we took this action:
1. First, we queried the unfound objects. osd.951 still needs the old version 3592634'367863320:
# ceph pg 4.1904 list_unfound { "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "rbd_data.3a4ff12d847b61.000000000001c39e", "key": "", "snapid": -2, "hash": 3686742276, "max": 0, "pool": 4, "namespace": "" }, "need": "3592634'367863320", "have": "0'0", "flags": "none", "locations": [] } ], "more": false }
2. We stopped osd.951 -> the PG went active on [14,344]. The PG is no longer recovery_unfound.
3. We marked osd.951 out... it's now backfilling to [774,14,344]. pg query at https://termbin.com/m8c2
4. After backfilling completed, we deep-scrubbed the PG and it found no inconsistencies. All good!
#2 Updated by Michel Jouvin over 1 year ago
Hi,
We suffered exactly the same problem at IJCLab: a flappy OSD (with unmonitored smartd preventive errors) caused an I/O error during a repair with the consequences described by Dan (attempt to restore a no longer existing version of a successfully updated object). We applied the same course of actions to fix the problem, with no data loss.
I have no opinion on whether Ceph could handle better this situation but I think that the Ceph documentation on how to handle "unfound objects", https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/?#failures-osd-unfound, deserve some improvements. Currently, it basically suggests that if you have unfound objects, you will have to declare the object lost. We were about to do it (which can be very harmful on a RBD volume) when we found Dan's ticket. I think that the situation described by Dan, and that we experienced, is one of the most likely to produce unfound objects in normal operations and in this case there is no reason to loose data if shutting down the faulty OSD. I think the documentation must be improved or at least reference the ticket as one of the possible cause with a specific solution.
Michel
#3 Updated by Dan van der Ster about 1 year ago
- Affected Versions v15.2.15 added
#4 Updated by Dan van der Ster about 1 year ago
Just hit this in a v15.2.15 cluster too. Michel which version does your cluster run?
#5 Updated by Radoslaw Zarzynski about 1 year ago
- Assignee set to Ronen Friedman
Would be great to have a replicator for this. Let's see whether we can make a standalone test exercising the sequence described at the bottom of description (the suspicion mentioned seems right).
#6 Updated by Enrico Bocchi 11 months ago
We had another occurrence of this on Pacific v16.2.9
#7 Updated by Radoslaw Zarzynski 11 months ago
- Backport set to pacific,quincy
#8 Updated by Konstantin Shalygin 11 months ago
- Affected Versions v16.2.9 added