Project

General

Profile

Actions

Bug #51641

open

qa/standalone/scrub/osd-scrub-test.sh: TEST_scrub_test() failed.

Added by Sridhar Seshasayee almost 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/sseshasa-2021-07-13_04:55:53-rados:standalone-wip-sseshasa-testing-2021-07-12-1850-distro-basic-smithi/6267736

It looks like a scrub repair didn't result in the removal of the "inconsistent" state on PG 1.0.

2021-07-13T07:38:03.503 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:92: TEST_scrub_test:  repair 1.0

...

2021-07-13T07:38:11.878 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1636: wait_for_clean:  test 1 = 1
2021-07-13T07:38:11.878 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1636: wait_for_clean:  break
2021-07-13T07:38:11.878 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1651: wait_for_clean:  return 0
2021-07-13T07:38:11.879 INFO:tasks.workunit.client.0.smithi193.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:96: TEST_scrub_test:  ceph pg 1.0 query
2021-07-13T07:38:11.879 INFO:tasks.workunit.client.0.smithi193.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:96: TEST_scrub_test:  jq '.peer_info[0].stats.stat_sum.num_scrub_errors'
2021-07-13T07:38:12.003 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:96: TEST_scrub_test:  test 2 = 2
2021-07-13T07:38:12.004 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test:  ceph pg dump pgs
2021-07-13T07:38:12.005 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test:  grep -vq -- +inconsistent
2021-07-13T07:38:12.005 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test:  grep '^1.0'
2021-07-13T07:38:12.256 INFO:tasks.workunit.client.0.smithi193.stderr:dumped pgs
2021-07-13T07:38:12.269 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:97: TEST_scrub_test:  return 1
2021-07-13T07:38:12.269 INFO:tasks.workunit.client.0.smithi193.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-test.sh:31: run:  return 1
Actions #1

Updated by Sridhar Seshasayee almost 3 years ago

I looked into the osd logs for more info. The repair was initiated on osd.1.

osd.1 got the request for shallow repair:

2021-07-13T07:38:04.002+0000 7f49c922e700 10 osd.1 31 handle_fast_scrub scrub2([1.0] repair) v1
2021-07-13T07:38:04.002+0000 7f49c922e700 15 osd.1 31 enqueue_peering_evt 1.0 epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)
2021-07-13T07:38:04.002+0000 7f49c922e700 20 osd.1 op_wq(0) _enqueue OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31)
2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process 1.0 to_process <> waiting <> waiting_peering {}
2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31) queued
2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process 1.0 to_process <OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31)> waiting <> waiting_peering {}
2021-07-13T07:38:04.002+0000 7f49af773700 20 osd.1 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)) prio 255 cost 10 e31) pg 0x55bc87f46000
2021-07-13T07:38:04.002+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean] do_peering_event: epoch_sent: 31 epoch_requested: 31 RequestScrub(shallow repair)
2021-07-13T07:38:04.002+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean]  scrubber pg(1.0) scrub_requested shallow  repair-scrub  prev stamp: 2021-07-14T11:32:42.609778+0000 1

The repair completed and the PG 1.0 moved to active+clean state as shown below:

2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 op_wq(0) _process OpSchedulerItem(1.0 PGScrubMapsCompared(pgid=1.0epoch_queued=31) prio 120 cost 52428800 e31) pg 0x55bc87f46000
2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ]  planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: MapsCompared queued at: 31
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ]  planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]  scrubber pg(1.0) scrubber event -->> send_maps_compared epoch: 31
2021-07-13T07:38:08.798+0000 7f49af773700 20  scrubberFSM  state: Scrub::WaitReplicas
2021-07-13T07:38:08.798+0000 7f49af773700 20  scrubberFSM  event: --vvvv---- MapsCompared
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ]  planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]  scrubberFSM pg(1.0) -- state -->> Act/WaitDigestUpdate
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ]  planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]  scrubberFSM pg(1.0) WaitDigestUpdate::react(const DigestUpdate&)
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ]  planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]  scrubber pg(1.0) on_digest_updates #pending: 0 pending? 0 <last chunk>
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ]  planned CHECK_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]  scrubber pg(1.0) scrub_finish before flags:  CHECK_REPAIR REQ_SCRUB. repair state: no-repair. deep_scrub_on_error: 0
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ] ]  PrimaryLog scrubber pg(1.0) _scrub_finish info stats: valid
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ] ]  PrimaryLog scrubber pg(1.0) deep-scrub got 15/15 objects, 0/0 clones, 15/15 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 15480/15480 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ] ]  scrubber pg(1.0) scrub_process_inconsistent: checking authoritative (mode=deep-scrub, auth remaining #: 0)
2021-07-13T07:38:08.798+0000 7f49af773700  0 log_channel(cluster) log [DBG] : 1.0 deep-scrub ok
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep+inconsistent [ 1.0:  CHECK_REPAIR REQ_SCRUB ] ]  scrubber pg(1.0) m_pg->recovery_state.update_stats()
2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ] prepare_stats_for_publish reporting purged_snaps []
2021-07-13T07:38:08.798+0000 7f49af773700 15 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ] publish_stats_to_osd 31:75
2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ] on_info_history_change
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ]  scrubber pg(1.0) unreg_next_scrub existing-2021-07-14T09:06:06.589761+0000. was registered? 1
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ]  scrubber pg(1.0) reg_next_scrub planned: must? 0 need-auto? 0 stamp: 2021-07-13T07:38:08.803965+0000
2021-07-13T07:38:08.798+0000 7f49af773700 15 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ]  scrubber pg(1.0) reg_next_scrub pg(1.0) must: 0 required:1 flags:  stamp: 2021-07-13T07:38:08.803965+0000
2021-07-13T07:38:08.798+0000 7f49af773700 15 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ]  scrubber pg(1.0) reg_next_scrub pg(1.0) register next scrub, scrub time 2021-07-14T12:00:06.725959+0000, must = 0
2021-07-13T07:38:08.798+0000 7f49af773700 10 log is not dirty
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ]  scrubber pg(1.0) update_op_mode_text: repair: visible: false, internal: false. Displayed: deep-scrub
2021-07-13T07:38:08.798+0000 7f49af773700 10 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean+scrubbing+deep [ 1.0:  REQ_SCRUB ] ]  scrubber pg(1.0) cleanup_on_finish
2021-07-13T07:38:08.798+0000 7f49af773700 20 osd.1 pg_epoch: 31 pg[1.0( v 18'15 (0'0,18'15] local-lis/les=30/31 n=15 ec=16/16 lis/c=30/30 les/c/f=31/31/0 sis=30) [1,0,2] r=0 lpr=30 crt=18'15 lcod 18'14 mlcod 0'0 active+clean] prepare_stats_for_publish reporting purged_snaps []

A few seconds later the command to dump pgs and check for state failed. From the logs, the state of the PG on osd.0 remained in active+inconsistent. Needs more investigation on why this is the case.

Actions #2

Updated by Ronen Friedman almost 3 years ago

At first glance:
- on OSD.1, PG 1.0 status was updated to 'clean' at 07:08.8
- AFAIK the update might take up to 5 seconds to register with the monitors, which might explain the failed query
some 3 seconds later.

Actions

Also available in: Atom PDF