Bug #51641
openqa/standalone/scrub/osd-scrub-test.sh: TEST_scrub_test() failed.
0%
Description
/a/sseshasa-2021-07-13_04:55:53-rados:standalone-wip-sseshasa-testing-2021-07-12-1850-distro-basic-smithi/6267736
It looks like a scrub repair didn't result in the removal of the "inconsistent" state on PG 1.0.
2021-07-13T07:38:03.503 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:92: TEST_scrub_test: repair 1.0 ... 2021-07-13T07:38:11.878 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1636: wait_for_clean: test 1 = 1 2021-07-13T07:38:11.878 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1636: wait_for_clean: break 2021-07-13T07:38:11.878 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1651: wait_for_clean: return 0 2021-07-13T07:38:11.879 INFO:tasks.workunit.client.0.smithi193.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:96: TEST_scrub_test: ceph pg 1.0 query 2021-07-13T07:38:11.879 INFO:tasks.workunit.client.0.smithi193.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:96: TEST_scrub_test: jq '.peer_info[0].stats.stat_sum.num_scrub_errors' 2021-07-13T07:38:12.003 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:96: TEST_scrub_test: test 2 = 2 2021-07-13T07:38:12.004 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test: ceph pg dump pgs 2021-07-13T07:38:12.005 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test: grep -vq -- +inconsistent 2021-07-13T07:38:12.005 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test: grep '^1.0' 2021-07-13T07:38:12.256 INFO:tasks.workunit.client.0.smithi193.stderr:dumped pgs 2021-07-13T07:38:12.269 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test: return 1 2021-07-13T07:38:12.269 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:31: run: return 1
Updated by Sridhar Seshasayee almost 3 years ago
I looked into the osd logs for more info. The repair was initiated on osd.1.
osd.1 got the request for shallow repair:
2021-07-13T07:38:04.002+0000 7f49c922e700 10 osd.1 31 handle_fast_scrub scrub2([1.0] repair) v1 2021-07-13T07:38:04.002+0000 7f49c922e700 15 osd.1 31 enqueue_peering_evt 1.0 epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair) 2021-07-13T07:38:04.002+0000 7f49c922e700 20 osd.1 op_wq(0) _enqueue OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31) 2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process 1.0 to_process <> waiting <> waiting_peering {} 2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31) queued 2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process 1.0 to_process <OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31)> waiting <> waiting_peering {} 2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31) pg 0x55bc87f46000 2021-07-13T07:38:04.002+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean] do_peering_event: epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair) 2021-07-13T07:38:04.002+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean] scrubber pg(1.0) scrub_requested shallow repair-scrub prev stamp: 2021-07-14T11:32:42.609778+0000 1
The repair completed and the PG 1.0 moved to active+clean state as shown below:
2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 op_wq(0) _process OpSchedulerItem(1.0 PGScrubMapsCompared(pgid=1.0epoch_queued=31) prio 120 cost 52428800 e31) pg 0x55bc87f46000 2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: MapsCompared queued at: 31 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber pg(1.0) scrubber event -->> send_maps_compared epoch: 31 2021-07-13T07:38:08.798+0000 7f49af773700 20 scrubberFSM state: Scrub::WaitReplicas 2021-07-13T07:38:08.798+0000 7f49af773700 20 scrubberFSM event: --vvvv---- MapsCompared 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubberFSM pg(1.0) -- state -->> Act/WaitDigestUpdate 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubberFSM pg(1.0) WaitDigestUpdate::react(const DigestUpdate&) 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber pg(1.0) on_digest_updates #pending: 0 pending? 0 <last chunk> 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber pg(1.0) scrub_finish before flags: CHECK_REPAIR REQ_SCRUB. repair state: no-repair. deep_scrub_on_error: 0 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] ] PrimaryLog scrubber pg(1.0) _scrub_finish info stats: valid 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] ] PrimaryLog scrubber pg(1.0) deep-scrub got 15/15 objects, 0/0 clones, 15/15 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 15480/15480 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] ] scrubber pg(1.0) scrub_process_inconsistent: checking authoritative (mode=deep-scrub, auth remaining #: 0) 2021-07-13T07:38:08.798+0000 7f49af773700 0 log_channel(cluster) log [DBG] : 1.0 deep-scrub ok 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0: CHECK_REPAIR REQ_SCRUB ] ] scrubber pg(1.0) m_pg->recovery_state.update_stats() 2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] prepare_stats_for_publish reporting purged_snaps [] 2021-07-13T07:38:08.798+0000 7f49af773700 15 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] publish_stats_to_osd 31:75 2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] on_info_history_change 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] scrubber pg(1.0) unreg_next_scrub existing-2021-07-14T09:06:06.589761+0000. was registered? 1 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] scrubber pg(1.0) reg_next_scrub planned: must? 0 need-auto? 0 stamp: 2021-07-13T07:38:08.803965+0000 2021-07-13T07:38:08.798+0000 7f49af773700 15 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] scrubber pg(1.0) reg_next_scrub pg(1.0) must: 0 required:1 flags: stamp: 2021-07-13T07:38:08.803965+0000 2021-07-13T07:38:08.798+0000 7f49af773700 15 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] scrubber pg(1.0) reg_next_scrub pg(1.0) register next scrub, scrub time 2021-07-14T12:00:06.725959+0000, must = 0 2021-07-13T07:38:08.798+0000 7f49af773700 10 log is not dirty 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] scrubber pg(1.0) update_op_mode_text: repair: visible: false, internal: false. Displayed: deep-scrub 2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0: REQ_SCRUB ] ] scrubber pg(1.0) cleanup_on_finish 2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean] prepare_stats_for_publish reporting purged_snaps []
A few seconds later the command to dump pgs and check for state failed. From the logs, the state of the PG on osd.0 remained in active+inconsistent. Needs more investigation on why this is the case.
Updated by Ronen Friedman almost 3 years ago
At first glance:
- on OSD.1, PG 1.0 status was updated to 'clean' at 07:08.8
- AFAIK the update might take up to 5 seconds to register with the monitors, which might explain the failed query
some 3 seconds later.