Bug #50222
openosd: 5.2s0 deep-scrub : stat mismatch
0%
Description
2021-04-07T05:00:34.818 INFO:tasks.ceph.osd.6.smithi066.stderr:2021-04-07T05:00:34.811+0000 7fe23d3a4700 -1 log_channel(cluster) log [ERR] : 5.2s0 deep-scrub : stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/4194304 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-04-07T05:00:34.819 INFO:tasks.ceph.osd.6.smithi066.stderr:2021-04-07T05:00:34.811+0000 7fe23d3a4700 -1 log_channel(cluster) log [ERR] : 5.2 deep-scrub 1 errors 2021-04-07T05:00:39.489 INFO:tasks.ceph.mon.a.smithi049.stderr:2021-04-07T05:00:39.481+0000 7fa8e20b9700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS) 2021-04-07T05:00:39.490 INFO:tasks.ceph.mon.a.smithi049.stderr:2021-04-07T05:00:39.481+0000 7fa8e20b9700 -1 log_channel(cluster) log [ERR] : Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
From: /ceph/teuthology-archive/pdonnell-2021-04-07_02:12:41-fs-wip-pdonnell-testing-20210406.213012-distro-basic-smithi/6025668/teuthology.log
With osd thrashing, osd-mds msgr delays/failures, and EC pools.
Updated by Patrick Donnelly almost 3 years ago
- Has duplicate Bug #51706: pacific: qa: osd deep-scrub stat mismatch added
Updated by Laura Flores about 2 years ago
Looks similar, but different test.
/a/yuriw-2022-02-09_22:52:18-rados-wip-yuri5-testing-2022-02-09-1322-pacific-distro-default-smithi/6672171
Description: rados/upgrade/nautilus-x-singleton/{0-cluster/{openstack start} 1-install/nautilus 2-partial-upgrade/firsthalf 3-thrash/default 4-workload/{rbd-cls rbd-import-export readwrite snaps-few-objects} 5-workload/{radosbench rbd_api} 6-finish-upgrade 7-pacific 8-workload/{rbd-python snaps-many-objects} bluestore-bitmap mon_election/connectivity thrashosds-health ubuntu_18.04}
Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-05-13_14:13:55-rados-wip-yuri3-testing-2022-05-12-1609-octopus-distro-default-smithi/6832544
Description: rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} msgr-failures/osd-dispatch-delay objectstore/bluestore-comp-zstd rados recovery-overrides/{default} supported-random-distro$/{ubuntu_latest} thrashers/careful thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}
teuthology.log - Health check reports an inconsistent pg:
2022-05-14T08:13:30.891 INFO:tasks.rados.rados.0.smithi081.stdout:2767: got expected ENOENT (src dne)
2022-05-14T08:13:30.892 INFO:tasks.ceph.mon.a.smithi081.stderr:2022-05-14T08:13:30.884+0000 7fa6e9dab700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2022-05-14T08:13:30.892 INFO:tasks.ceph.mon.a.smithi081.stderr:2022-05-14T08:13:30.884+0000 7fa6e9dab700 -1 log_channel(cluster) log [ERR] : Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
The affected pg in this instance is 3.1e. Here is the point ceph-osd.3.log.gz where it first becomes inconsistent. We can see that this occurs during deep scrubbing:
2022-05-14T08:13:24.228+0000 7fa5d70fd700 10 osd.3 pg_epoch: 621 pg[3.1es0( v 620'1485 (0'0,620'1485] local-lis/les=582/583 n=7 ec=246/26 lis/c=582/582 les/c/f=583/587/0 sis=582) [3,9,8,2,7,6]p3(0) r=0 lpr=582 crt=620'1485 lcod 619'1483 mlcod 619'1483 active+clean+scrubbing+deep trimq=[1ff~1] ps=[188~1,1d0~1,1d7~2,1db~1,1e5~2,1eb~1,1ed~4,1f4~2,1fa~4]] deep-scrub got 7/7 objects, 3/3 clones, 7/7 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 11272192/11845632 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2022-05-14T08:13:24.228+0000 7fa5d70fd700 10 osd.3 pg_epoch: 621 pg[3.1es0( v 620'1485 (0'0,620'1485] local-lis/les=582/583 n=7 ec=246/26 lis/c=582/582 les/c/f=583/587/0 sis=582) [3,9,8,2,7,6]p3(0) r=0 lpr=582 crt=620'1485 lcod 619'1483 mlcod 619'1483 active+clean+scrubbing+deep trimq=[1ff~1] ps=[188~1,1d0~1,1d7~2,1db~1,1e5~2,1eb~1,1ed~4,1f4~2,1fa~4]] scrub_process_inconsistent: checking authoritative
2022-05-14T08:13:24.228+0000 7fa5d70fd700 20 osd.3 pg_epoch: 621 pg[3.1es0( v 620'1485 (0'0,620'1485] local-lis/les=582/583 n=7 ec=246/26 lis/c=582/582 les/c/f=583/587/0 sis=582) [3,9,8,2,7,6]p3(0) r=0 lpr=582 crt=620'1485 lcod 619'1483 mlcod 619'1483 active+clean+scrubbing+deep+inconsistent trimq=[1ff~1] ps=[188~1,1d0~1,1d7~2,1db~1,1e5~2,1eb~1,1ed~4,1f4~2,1fa~4]] prepare_stats_for_publish reporting purged_snaps [188~1,1d0~1,1d7~2,1db~1,1e5~2,1eb~1,1ed~4,1f4~2,1fa~4]
2022-05-14T08:13:24.228+0000 7fa5d70fd700 15 osd.3 pg_epoch: 621 pg[3.1es0( v 620'1485 (0'0,620'1485] local-lis/les=582/583 n=7 ec=246/26 lis/c=582/582 les/c/f=583/587/0 sis=582) [3,9,8,2,7,6]p3(0) r=0 lpr=582 crt=620'1485 lcod 619'1483 mlcod 619'1483 active+clean+scrubbing+deep+inconsistent trimq=[1ff~1] ps=[188~1,1d0~1,1d7~2,1db~1,1e5~2,1eb~1,1ed~4,1f4~2,1fa~4]] publish_stats_to_osd 620:3621
ceph.log.gz shows also that the "stat mismatch" first occurs during deep scrub on pg 3.1e, which causes the pg to enter an inconsistent state:
2022-05-14T08:13:23.220943+0000 osd.3 (osd.3) 50 : cluster [DBG] 3.1e deep-scrub starts
2022-05-14T08:13:23.850823+0000 mon.a (mon.0) 2182 : cluster [WRN] Health check failed: noscrub flag(s) set (OSDMAP_FLAGS)
2022-05-14T08:13:23.865188+0000 mon.a (mon.0) 2185 : cluster [DBG] osdmap e621: 12 total, 12 up, 12 in
2022-05-14T08:13:24.232155+0000 osd.3 (osd.3) 51 : cluster [ERR] 3.1es0 deep-scrub : stat mismatch, got 7/7 objects, 3/3 clones, 7/7 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 11272192/11845632 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2022-05-14T08:13:24.232175+0000 osd.3 (osd.3) 52 : cluster [ERR] 3.1e deep-scrub 1 errors
2022-05-14T08:13:24.461896+0000 mgr.x (mgr.4108) 388 : cluster [DBG] pgmap v936: 65 pgs: 1 recovery_wait+undersized+degraded+peered, 1 active+clean+scrubbing+deep, 1 active+clean+scrubbing, 1 active+remapped+backfill_toofull, 61 active+clean; 187 MiB data, 360 MiB used, 1.0 TiB / 1.1 TiB avail; 0 B/s wr, 5 op/s; 225/456 objects degraded (49.342%); 3/456 objects misplaced (0.658%); 0 B/s, 0 keys/s, 0 objects/s recovering
2022-05-14T08:13:24.860234+0000 mon.a (mon.0) 2187 : cluster [WRN] Health check failed: Degraded data redundancy: 225/456 objects degraded (49.342%), 1 pg degraded (PG_DEGRADED)
2022-05-14T08:13:24.872905+0000 mon.a (mon.0) 2189 : cluster [DBG] osdmap e622: 12 total, 12 up, 12 in
2022-05-14T08:13:26.462690+0000 mgr.x (mgr.4108) 392 : cluster [DBG] pgmap v938: 65 pgs: 1 recovery_wait+undersized+degraded+peered, 1 active+clean+scrubbing+deep, 1 active+clean+scrubbing, 1 active+remapped+backfill_toofull, 61 active+clean; 158 MiB data, 331 MiB used, 1.0 TiB / 1.1 TiB avail; 0 B/s wr, 4 op/s; 225/408 objects degraded (55.147%); 4/408 objects misplaced (0.980%); 0 B/s, 0 keys/s, 0 objects/s recovering
2022-05-14T08:13:26.720695+0000 mon.a (mon.0) 2194 : cluster [DBG] osdmap e623: 12 total, 12 up, 12 in
2022-05-14T08:13:27.876665+0000 mon.a (mon.0) 2195 : cluster [DBG] osdmap e624: 12 total, 12 up, 12 in
2022-05-14T08:13:28.463582+0000 mgr.x (mgr.4108) 393 : cluster [DBG] pgmap v941: 65 pgs: 1 recovering+undersized+degraded+peered, 1 active+clean+snaptrim, 1 active+clean+scrubbing+deep, 1 active+remapped+backfill_toofull, 61 active+clean; 158 MiB data, 335 MiB used, 1.0 TiB / 1.1 TiB avail; 2.3 MiB/s rd, 10 KiB/s wr, 3 op/s; 4/408 objects degraded (0.980%); 29/408 objects misplaced (7.108%); 0 B/s, 0 keys/s, 5 objects/s recovering
2022-05-14T08:13:28.882108+0000 mon.a (mon.0) 2197 : cluster [DBG] osdmap e625: 12 total, 12 up, 12 in
2022-05-14T08:13:29.878578+0000 mon.a (mon.0) 2199 : cluster [WRN] Health check update: nodeep-scrub flag(s) set (OSDMAP_FLAGS)
2022-05-14T08:13:29.885488+0000 mon.a (mon.0) 2201 : cluster [DBG] osdmap e626: 12 total, 12 up, 12 in
2022-05-14T08:13:30.464355+0000 mgr.x (mgr.4108) 396 : cluster [DBG] pgmap v944: 65 pgs: 1 active+clean+inconsistent+snaptrim, 1 active+clean+snaptrim_wait, 1 recovering+undersized+degraded+peered, 2 active+clean+snaptrim, 1 active+remapped+backfill_toofull, 59 active+clean; 161 MiB data, 333 MiB used, 1.0 TiB / 1.1 TiB avail; 4.3 MiB/s rd, 2.5 MiB/s wr, 12 op/s; 4/420 objects degraded (0.952%); 29/420 objects misplaced (6.905%); 0 B/s, 5 objects/s recovering
2022-05-14T08:13:30.709741+0000 mon.a (mon.0) 2203 : cluster [WRN] Health check update: Degraded data redundancy: 4/408 objects degraded (0.980%), 1 pg degraded (PG_DEGRADED)
2022-05-14T08:13:30.889499+0000 mon.a (mon.0) 2204 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2022-05-14T08:13:30.889554+0000 mon.a (mon.0) 2205 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
Later in ceph.log.gz, all pgs eventually become active+clean, but osd.0 is reported to fail:
2022-05-14T08:19:28.611362+0000 mgr.x (mgr.4108) 653 : cluster [DBG] pgmap v1418: 59 pgs: 59 active+clean; 0 B data, 129 MiB used, 1.0 TiB / 1.1 TiB avail
2022-05-14T08:19:29.397740+0000 mon.a (mon.0) 3183 : cluster [DBG] osd.0 reported immediately failed by osd.3
2022-05-14T08:19:29.397832+0000 mon.a (mon.0) 3184 : cluster [INF] osd.0 failed (root=default,host=smithi081) (connection refused reported by osd.3)
The teuthology.log reveals in the last recorded pgmap that pg 3.13 is "active+clean+inconsistent":
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 81
},
{
"state_name": "active+remapped+backfill_toofull",
"count": 2
},
{
"state_name": "peering",
"count": 1
},
{
"state_name": "active+clean+inconsistent",
"count": 1
}
],
"num_pgs": 85,
"num_pools": 3,
"num_objects": 201,
"data_bytes": 407617536,
"bytes_used": 10346008576,
"bytes_avail": 859384868864,
"bytes_total": 869730877440,
"inactive_pgs_ratio": 0.0117647061124444,
"misplaced_objects": 37,
"misplaced_total": 1190,
"misplaced_ratio": 0.031092436974789917,
"recovering_objects_per_sec": 1,
"recovering_bytes_per_sec": 4470879,
"recovering_keys_per_sec": 0,
"num_objects_recovered": 7,
"num_bytes_recovered": 22675456,
"num_keys_recovered": 0,
"read_bytes_sec": 3431497,
"write_bytes_sec": 0,
"read_op_per_sec": 2,
"write_op_per_sec": 4
}
Updated by Laura Flores almost 2 years ago
- Backport set to quincy,pacific,octopus
Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-06-07_19:48:58-rados-wip-yuri6-testing-2022-06-07-0955-pacific-distro-default-smithi/6866688
Updated by Laura Flores almost 2 years ago
- Related to Bug #52737: osd/tests: stat mismatch added
Updated by Laura Flores almost 2 years ago
- Related to deleted (Bug #52737: osd/tests: stat mismatch )
Updated by Laura Flores almost 2 years ago
- Has duplicate Bug #52737: osd/tests: stat mismatch added
Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-06-14_20:42:00-rados-wip-yuri2-testing-2022-06-14-0949-octopus-distro-default-smithi/6878271
Updated by Radoslaw Zarzynski almost 2 years ago
- Assignee set to Laura Flores
Not a terribly high priority.
Updated by Rishabh Dave almost 2 years ago
This error showed up in QA runs -
http://pulpito.front.sepia.ceph.com/rishabh-2022-07-08_23:53:34-fs-wip-rishabh-testing-2022Jul08-1820-testing-default-smithi/6921153
http://pulpito.front.sepia.ceph.com/rishabh-2022-07-08_23:53:34-fs-wip-rishabh-testing-2022Jul08-1820-testing-default-smithi/6921085
Showed up in re-run as well -
http://pulpito.front.sepia.ceph.com/rishabh-2022-07-15_06:42:04-fs-wip-rishabh-testing-2022Jul08-1820-testing-default-smithi/6931942
http://pulpito.front.sepia.ceph.com/rishabh-2022-07-15_06:42:04-fs-wip-rishabh-testing-2022Jul08-1820-testing-default-smithi/6931950
Updated by Laura Flores almost 2 years ago
- Status changed from New to In Progress
Thanks Rishabh, I am having a look into this.
Updated by Laura Flores almost 2 years ago
All the tests that this has failed on involve thrashing. Specifically, they all use thrashosds-health.yaml (https://github.com/ceph/ceph/blob/main/qa/tasks/thrashosds-health.yaml).
The tests are failing due to a shallow scrub error:
2022-07-15T10:38:05.497+0000 7f2cd10f3700 10 osd.6 pg_epoch: 4260 pg[5.11s0( v 4101'12884 (3893'7859,4101'12884] local-lis/les=4141/4142 n=77 ec=25/25 lis/c=4141/4141 les/c/f=4142/4142/0 sis=4141) [6,4,7,1]p6(0) r=0 lpr=4141 crt=4101'12884 lcod 4101'12883 mlcod 0'0 active+clean+scrubbing+deep [ 5.11s0: REQ_SCRUB ] ] scrubber <WaitDigestUpdate/>: m_pg->recovery_state.update_stats() errors:1/0 deep? 1
What I'm noticing is that we do not have auto-repair enabled during these tests:
4 lcod 4101'12883 mlcod 0'0 active+clean MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] verify_scrub_mode pg: 5.11s0 allow: 1/1 deep errs: 0 auto-repair: 0 (0)
A solution we can try is enabling auto-repair by setting `osd_scrub_auto_repair=true` in thrashosds-health.yaml. This config option automatically repairs up to 5 scrub errors found by scrubs or deep-scrubs (see https://github.com/ceph/ceph/blob/f5857acd68e2349dc10207cd4110b84225a3ea42/src/common/options/osd.yaml.in#L366-L384).
I am currently running some tests to reproduce the issue, and then I will try enabling auto-repair to see if this helps.
Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
yuriw-2022-09-27_23:37:28-rados-wip-yuri2-testing-2022-09-27-1455-distro-default-smithi/7046253
Updated by Radoslaw Zarzynski over 1 year ago
Hi Laura. In luck with verification of the hypothesis from the comment #17?
Updated by Laura Flores 12 months ago
- Translation missing: en.field_tag_list set to test-failure
Radoslaw Zarzynski wrote:
Hi Laura. In luck with verification of the hypothesis from the comment #17?
I ran this solution by Ronen, but he said auto-repair wouldn't help with deep scrub errors. I will revisit this.
/a/yuriw-2023-04-25_14:15:40-rados-pacific-release-distro-default-smithi/7251528
/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253902
/a/yuriw-2023-05-10_14:47:51-rados-wip-yuri5-testing-2023-05-09-1324-pacific-distro-default-smithi/7269830
/a/yuriw-2023-06-06_18:42:04-rados-wip-yuri8-testing-2023-06-06-0830-reef-distro-default-smithi/7297252
Updated by Radoslaw Zarzynski 11 months ago
- Assignee changed from Laura Flores to Ronen Friedman
Updated by Laura Flores 9 months ago
/a/yuriw-2023-07-19_14:33:14-rados-wip-yuri11-testing-2023-07-18-0927-pacific-distro-default-smithi/7343461
Updated by Patrick Donnelly 8 months ago
/teuthology/pdonnell-2023-08-31_15:31:51-fs-wip-batrick-testing-20230831.124848-pacific-distro-default-smithi/7385689/teuthology.log
Updated by Nitzan Mordechai 16 days ago
/a/yuriw-2024-04-02_15:39:50-rados-wip-yuri2-testing-2024-04-01-1235-quincy-distro-default-smithi/7636628