Project

General

Profile

Bug #51194

PG recovery_unfound after scrub repair failed on primary

Added by Dan van der Ster over 1 year ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This comes from a mail I send to the ceph-users ML: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3LR2PC2KXVK66KG7U6ADGIC2IRGVQIAL/

The cluster here is running v14.2.20 and is used for RBD images. It is critical for our operations so appreciate any advice how to proceed.

PG 4.1904 went into state active+recovery_unfound+degraded+repair [1] during normal background scrubbing (but note that we have `osd scrub auto repair = true`).
The output of pg query is at https://pastebin.com/46XHVKps

The scrub detected a read error on the primary of this PG:

2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG] 4.1904 repair starts
2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error

Here's the corresponding kernel log of the IO error on the primary:
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DR
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key : Medium Error [current] [descript
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense: Unrecovered read error
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88 00 00 00 00 02 ba 8c 0b 00
Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error, dev sdp, sector 1171967531

But it seems that the other 2 osds could not repair this failed read on the primary because they don't have the correct version of the object:

2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR] 4.1904 repair 0 missing, 1 inconsistent objects
2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR] 4.1904 repair 1 errors, 1 fixed
2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster [DBG] pgmap v557097: 19456 pgs: 1 active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing,
19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141 objects unfound (0.000%)

There were no other IO errors or failures during this time window.

We compared the object version numbers with the osdmap epoch at the time the object went unfound -- indeed the osdmap was e3593555 when this PG was marked recovery_unfound. So this implies that the osds 14 and 344 have a newer version of the object, written while scrub/repair was ongoing:

2021-06-13 03:50:13.808204 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232629 : cluster [DBG] osdmap e3593554: 1224 total, 1222 up, 1222 in
...
2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error
...
2021-06-13 04:04:50.359786 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232802 : cluster [DBG] osdmap e3593555: 1224 total, 1222 up, 1222 in
...
2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656
...
2021-06-13 04:31:17.106383 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1233177 : cluster [DBG] osdmap e3593556: 1224 total, 1222 up, 1222 in

Could this be a race in the PG repair? Something like:

1. Repair starts, osd.951 has object read error when it is v 3592634'367863320.
2. Later on during repair, that object is updated by the client, bumping it to v 3593555'368312656
3. Repair completes a few minutes later, osd.951 asks his peers to recover the object at the older v 3592634'367863320 --> they don't have it anymore, object is marked lost.

How do we clean up from this? We are considering:

  • run `ceph pg repair ..` again now and hope it repairs with the correct new version of the object.
  • Stop osd.951 and let the data backfill to other osds, then repair again.
  • Set osd.951's primary-affinity to zero so one of the other OSDs becomes primary, then repair again.

Or maybe we should just run `mark_unfound_lost revert`? Will it take v 3593555'368312656 from the peer OSDs ?

History

#1 Updated by Dan van der Ster over 1 year ago

To prevent the user IO from being blocked, we took this action:

1. First, we queried the unfound objects. osd.951 still needs the old version 3592634'367863320:

# ceph pg 4.1904 list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "rbd_data.3a4ff12d847b61.000000000001c39e",
                "key": "",
                "snapid": -2,
                "hash": 3686742276,
                "max": 0,
                "pool": 4,
                "namespace": "" 
            },
            "need": "3592634'367863320",
            "have": "0'0",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

2. We stopped osd.951 -> the PG went active on [14,344]. The PG is no longer recovery_unfound.
3. We marked osd.951 out... it's now backfilling to [774,14,344]. pg query at https://termbin.com/m8c2
4. After backfilling completed, we deep-scrubbed the PG and it found no inconsistencies. All good!

#2 Updated by Michel Jouvin 3 months ago

Hi,

We suffered exactly the same problem at IJCLab: a flappy OSD (with unmonitored smartd preventive errors) caused an I/O error during a repair with the consequences described by Dan (attempt to restore a no longer existing version of a successfully updated object). We applied the same course of actions to fix the problem, with no data loss.

I have no opinion on whether Ceph could handle better this situation but I think that the Ceph documentation on how to handle "unfound objects", https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/?#failures-osd-unfound, deserve some improvements. Currently, it basically suggests that if you have unfound objects, you will have to declare the object lost. We were about to do it (which can be very harmful on a RBD volume) when we found Dan's ticket. I think that the situation described by Dan, and that we experienced, is one of the most likely to produce unfound objects in normal operations and in this case there is no reason to loose data if shutting down the faulty OSD. I think the documentation must be improved or at least reference the ticket as one of the possible cause with a specific solution.

Michel

#3 Updated by Dan van der Ster 3 months ago

  • Affected Versions v15.2.15 added

#4 Updated by Dan van der Ster 3 months ago

Just hit this in a v15.2.15 cluster too. Michel which version does your cluster run?

#5 Updated by Radoslaw Zarzynski 3 months ago

  • Assignee set to Ronen Friedman

Would be great to have a replicator for this. Let's see whether we can make a standalone test exercising the sequence described at the bottom of description (the suspicion mentioned seems right).

Also available in: Atom PDF