Project

General

Profile

Actions

Bug #53663

closed

Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools

Added by Christian Rohmann over 2 years ago. Updated about 2 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a 4 node Octopus cluster I am randomly seeing batches of scrub errors, as in:

# ceph health detail

HEALTH_ERR 7 scrub errors; Possible data damage: 6 pgs inconsistent
[ERR] OSD_SCRUB_ERRORS: 7 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 6 pgs inconsistent
    pg 5.3 is active+clean+inconsistent, acting [9,12,6]
    pg 5.4 is active+clean+inconsistent, acting [15,17,18]
    pg 7.2 is active+clean+inconsistent, acting [13,15,10]
    pg 7.9 is active+clean+inconsistent, acting [5,19,4]
    pg 7.e is active+clean+inconsistent, acting [1,15,20]
    pg 7.18 is active+clean+inconsistent, acting [5,10,0]

The cluster was setup straight with Octopus, no upgrades from a previous release.
Also it is only serving traffic via RADOSGW and it's a multisite setup with this cluster being the zone master.

The scrub errors seem to occur in two distinct pools only:

# rados list-inconsistent-pg $pool

Pool: zone.rgw.log
["5.3","5.4"]

Pool: zone.rgw.buckets.index
["7.2","7.9","7.e","7.18"]

but they are spread across different OSD and hosts:

# ceph osd tree
ID  CLASS  WEIGHT     TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
-1         363.23254  root default
-3          90.80814      host host-01
 1    hdd   15.13469          osd.1                 up   1.00000  1.00000
 5    hdd   15.13469          osd.5                 up   1.00000  1.00000
 9    hdd   15.13469          osd.9                 up   1.00000  1.00000
13    hdd   15.13469          osd.13                up   1.00000  1.00000
17    hdd   15.13469          osd.17                up   1.00000  1.00000
21    hdd   15.13469          osd.21                up   1.00000  1.00000
-5          90.80814      host host-02
 0    hdd   15.13469          osd.0                 up   1.00000  1.00000
 4    hdd   15.13469          osd.4                 up   1.00000  1.00000
 8    hdd   15.13469          osd.8                 up   1.00000  1.00000
12    hdd   15.13469          osd.12                up   1.00000  1.00000
16    hdd   15.13469          osd.16                up   1.00000  1.00000
20    hdd   15.13469          osd.20                up   1.00000  1.00000
-9          90.80814      host host-03
 2    hdd   15.13469          osd.2                 up   1.00000  1.00000
 6    hdd   15.13469          osd.6                 up   1.00000  1.00000
10    hdd   15.13469          osd.10                up   1.00000  1.00000
14    hdd   15.13469          osd.14                up   1.00000  1.00000
18    hdd   15.13469          osd.18                up   1.00000  1.00000
23    hdd   15.13469          osd.23                up   1.00000  1.00000
-7          90.80814      host host-04
 3    hdd   15.13469          osd.3                 up   1.00000  1.00000
 7    hdd   15.13469          osd.7                 up   1.00000  1.00000
11    hdd   15.13469          osd.11                up   1.00000  1.00000
15    hdd   15.13469          osd.15                up   1.00000  1.00000
19    hdd   15.13469          osd.19                up   1.00000  1.00000
22    hdd   15.13469          osd.22                up   1.00000  1.00000

(Just as a side note: Each host has a single NVME journal disk shared via LVM to all 6 spinning rust OSDs.)

Even though there seems to be a host with more errors on its OSDs, so likely something like bad hardware, it simply was a different host with multiple of its OSDs having inconsistent pgs. BTW, those were fixed via a call of pg repair).

I attached the list-inconsistencies output of all the PG, they all report omap_digest_mismatch with either a bucket index or a datalog file. I cannot recall that past errors were ever about bucket data itself.

At few days ago I triggered a deep scrub of all OSD on one host which came back clean and a day later that host (-02) reported multiple errors again.

The cluster also went through a upgrade of Ubuntu Bionic to Ubuntu Focal (keeping Ceph at 15.2.15 and all of the OSDs), but the randomly occurring scrub errors remained. So an issue with a particular OS / kernel version could be softly ruled out as well.

Please excuse my initial selection of severity critical for this bug report. But this seems rather unlikely to be a simple hardware issue as a broken disk and more of a silent corruption issue of RADOSGW metadata. Also there should be no simple config glitch I could have made causing this.


Files

list-inconsistences.tar (30 KB) list-inconsistences.tar Details on the found inconsistencies Christian Rohmann, 12/19/2021 12:42 AM

Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #54592: partial recovery: CEPH_OSD_OP_OMAPRMKEYRANGE should mark omap dirtyResolvedNeha Ojha

Actions
Actions

Also available in: Atom PDF