Bug #53663
closedRandom scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools
0%
Description
On a 4 node Octopus cluster I am randomly seeing batches of scrub errors, as in:
# ceph health detail HEALTH_ERR 7 scrub errors; Possible data damage: 6 pgs inconsistent [ERR] OSD_SCRUB_ERRORS: 7 scrub errors [ERR] PG_DAMAGED: Possible data damage: 6 pgs inconsistent pg 5.3 is active+clean+inconsistent, acting [9,12,6] pg 5.4 is active+clean+inconsistent, acting [15,17,18] pg 7.2 is active+clean+inconsistent, acting [13,15,10] pg 7.9 is active+clean+inconsistent, acting [5,19,4] pg 7.e is active+clean+inconsistent, acting [1,15,20] pg 7.18 is active+clean+inconsistent, acting [5,10,0]
The cluster was setup straight with Octopus, no upgrades from a previous release.
Also it is only serving traffic via RADOSGW and it's a multisite setup with this cluster being the zone master.
The scrub errors seem to occur in two distinct pools only:
# rados list-inconsistent-pg $pool Pool: zone.rgw.log ["5.3","5.4"] Pool: zone.rgw.buckets.index ["7.2","7.9","7.e","7.18"]
but they are spread across different OSD and hosts:
# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 363.23254 root default -3 90.80814 host host-01 1 hdd 15.13469 osd.1 up 1.00000 1.00000 5 hdd 15.13469 osd.5 up 1.00000 1.00000 9 hdd 15.13469 osd.9 up 1.00000 1.00000 13 hdd 15.13469 osd.13 up 1.00000 1.00000 17 hdd 15.13469 osd.17 up 1.00000 1.00000 21 hdd 15.13469 osd.21 up 1.00000 1.00000 -5 90.80814 host host-02 0 hdd 15.13469 osd.0 up 1.00000 1.00000 4 hdd 15.13469 osd.4 up 1.00000 1.00000 8 hdd 15.13469 osd.8 up 1.00000 1.00000 12 hdd 15.13469 osd.12 up 1.00000 1.00000 16 hdd 15.13469 osd.16 up 1.00000 1.00000 20 hdd 15.13469 osd.20 up 1.00000 1.00000 -9 90.80814 host host-03 2 hdd 15.13469 osd.2 up 1.00000 1.00000 6 hdd 15.13469 osd.6 up 1.00000 1.00000 10 hdd 15.13469 osd.10 up 1.00000 1.00000 14 hdd 15.13469 osd.14 up 1.00000 1.00000 18 hdd 15.13469 osd.18 up 1.00000 1.00000 23 hdd 15.13469 osd.23 up 1.00000 1.00000 -7 90.80814 host host-04 3 hdd 15.13469 osd.3 up 1.00000 1.00000 7 hdd 15.13469 osd.7 up 1.00000 1.00000 11 hdd 15.13469 osd.11 up 1.00000 1.00000 15 hdd 15.13469 osd.15 up 1.00000 1.00000 19 hdd 15.13469 osd.19 up 1.00000 1.00000 22 hdd 15.13469 osd.22 up 1.00000 1.00000
(Just as a side note: Each host has a single NVME journal disk shared via LVM to all 6 spinning rust OSDs.)
Even though there seems to be a host with more errors on its OSDs, so likely something like bad hardware, it simply was a different host with multiple of its OSDs having inconsistent pgs. BTW, those were fixed via a call of pg repair
).
I attached the list-inconsistencies
output of all the PG, they all report omap_digest_mismatch
with either a bucket index or a datalog file. I cannot recall that past errors were ever about bucket data itself.
At few days ago I triggered a deep scrub of all OSD on one host which came back clean and a day later that host (-02) reported multiple errors again.
The cluster also went through a upgrade of Ubuntu Bionic to Ubuntu Focal (keeping Ceph at 15.2.15 and all of the OSDs), but the randomly occurring scrub errors remained. So an issue with a particular OS / kernel version could be softly ruled out as well.
Please excuse my initial selection of severity critical for this bug report. But this seems rather unlikely to be a simple hardware issue as a broken disk and more of a silent corruption issue of RADOSGW metadata. Also there should be no simple config glitch I could have made causing this.
Files