Bug #36525: osd-scrub-snaps.sh failure - RADOS - Ceph

Actions

Copy link

Bug #36525

closed

osd-scrub-snaps.sh failure

Added by Sage Weil over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2018-10-18T01:38:14.300 INFO:tasks.workunit.client.0.smithi068.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1719: wait_for_scrub:  get_last_scrub_stamp 1.0 last_scrub_stamp
2018-10-18T01:38:14.300 INFO:tasks.workunit.client.0.smithi068.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1369: get_last_scrub_stamp:  local pgid=1.0
2018-10-18T01:38:14.300 INFO:tasks.workunit.client.0.smithi068.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1370: get_last_scrub_stamp:  local sname=last_scrub_stamp
2018-10-18T01:38:14.300 INFO:tasks.workunit.client.0.smithi068.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1371: get_last_scrub_stamp:  ceph --format json pg dump pgs
2018-10-18T01:38:14.300 INFO:tasks.workunit.client.0.smithi068.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1372: get_last_scrub_stamp:  jq -r '.pg_stats | .[] | select(.pgid=="1.0") | .last_scrub_stamp'
2018-10-18T01:38:14.703 INFO:tasks.workunit.client.0.smithi068.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1719: wait_for_scrub:  test '2018-10-18 01:38:11.121940' '>' '2018-10-18 01:37:39.284702'
2018-10-18T01:38:14.704 INFO:tasks.workunit.client.0.smithi068.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1720: wait_for_scrub:  return 0
2018-10-18T01:38:14.704 INFO:tasks.workunit.client.0.smithi068.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-snaps.sh:189: TEST_scrub_snaps:grep '_scan_snaps start' td/osd-scrub-snaps/osd.0.log
2018-10-18T01:38:14.704 INFO:tasks.workunit.client.0.smithi068.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-snaps.sh:189: TEST_scrub_snaps:wc -l
2018-10-18T01:38:14.707 INFO:tasks.workunit.client.0.smithi068.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-snaps.sh:189: TEST_scrub_snaps: test 4 = 2
2018-10-18T01:38:14.707 INFO:tasks.workunit.client.0.smithi068.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-snaps.sh:189: TEST_scrub_snaps: return 1
2018-10-18T01:38:14.707 INFO:tasks.workunit.client.0.smithi068.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-snaps.sh:38: run:  return 1
2018-10-18T01:38:14.708 INFO:tasks.workunit.client.0.smithi068.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1984: main:  code=1

/a/sage-2018-10-17_20:20:33-rados-nautilus-distro-basic-smithi/3154179

Actions

Copy link

Updated by Sage Weil over 5 years ago

/a/sage-2018-10-28_14:12:19-rados-master-distro-basic-smithi/3196520

another instance on current master

Actions

Copy link

Updated by David Zafman over 5 years ago

Assignee set to David Zafman

Looking at the log another scrub has made the number of "_scan_snaps start" in the log from 2 to 4. It results in 2 identical runs. Unless the second scrub is wrong, this is purely a test issue.

Actions

Copy link

Updated by David Zafman over 5 years ago

Subject changed from osd-scrub-snaps.sh faiulre to osd-scrub-snaps.sh failure
Priority changed from Urgent to High

Actions

Copy link

Updated by Kefu Chai over 5 years ago

/a/kchai-2018-11-27_11:44:27-rados-wip-kefu2-testing-2018-11-27-1724-distro-basic-smithi/3285226/teuthology.log

Actions

Copy link

Updated by David Zafman over 5 years ago

The test script does a single "ceph pg scrub 1.0" It shows the duplicate scrub:

2018-12-21 00:14:53.446 7f051af6f700  5 -- 127.0.0.1:6800/188026 >> 127.0.0.1:52512/2837688723 conn(0x563821cb6800 legacy :6800 s=READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx client.4310 seq 1 0x563821c36a00 command(tid 0: {"prefix": "pg scrub", "pgid": "1.0", "target": ["mgr", ""]}) v1
2018-12-21 00:14:53.446 7f05036a8700  1 -- 127.0.0.1:6800/188026 <== client.4310 127.0.0.1:52512/2837688723 1 ==== command(tid 0: {"prefix": "pg scrub", "pgid": "1.0", "target": ["mgr", ""]}) v1 ==== 84+0+0 (425507551 0 0) 0x563821c36a00 con 0x563821cb6800
...
2018-12-21 00:14:53.568 7f051af6f700  5 -- 127.0.0.1:6800/188026 >> 127.0.0.1:52512/2837688723 conn(0x563821cb6800 legacy :6800 s=READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx client.4310 seq 2 0x56381f37a000 command(tid 1: {"prefix": "pg scrub", "pgid": "1.0", "target": ["mgr", ""]}) v1
2018-12-21 00:14:53.568 7f05036a8700  1 -- 127.0.0.1:6800/188026 <== client.4310 127.0.0.1:52512/2837688723 2 ==== command(tid 1: {"prefix": "pg scrub", "pgid": "1.0", "target": ["mgr", ""]}) v1 ==== 84+0+0 (425507551 0 0) 0x56381f37a000 con 0x563821cb6800

I got the log from a successful run and this scrub isn't duplicated.

2018-12-20 17:41:44.333 7f23d7a8d700  5 -- 127.0.0.1:6804/40560 >> 127.0.0.1:48144/562174776 conn(0x5558619cc400 legacy :6804 s=READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx client.4308 seq 1 0x555861851600 command(tid 0: {"prefix": "pg scrub", "pgid": "1.0", "target": ["mgr", ""]}) v1
2018-12-20 17:41:44.333 7f23c6d59700  1 -- 127.0.0.1:6804/40560 <== client.4308 127.0.0.1:48144/562174776 1 ==== command(tid 0: {"prefix": "pg scrub", "pgid": "1.0", "target": ["mgr", ""]}) v1 ==== 84+0+0 (425507551 0 0) 0x555861851600 con 0x5558619cc400

Actions

Copy link

Updated by David Zafman over 5 years ago

Reproduced in 1 of 20 jobs after running teuthology-suite --machine-type smithi --suite rados --ceph wip-zafman-testing --filter scrub.yaml --num 20

/a/dzafman-2018-12-20_15:18:07-rados-wip-zafman-testing-distro-basic-smithi/3385260

Actions

Copy link

Updated by Kefu Chai over 5 years ago

i am not able to reproduce this issue locally after running "../qa/run-standalone.sh osd-scrub-snaps.sh TEST_scrub_snaps" for 21 times.

Actions

Copy link

Updated by David Zafman over 5 years ago

I see that qa/tasks/workunit.py does set run.Raw('CEPH_CLI_TEST_DUP_COMMAND=1'). The interesting thing is that there is a race even when this is set. I added export CEPH_CLI_TEST_DUP_COMMAND=1 to qa/standalone/scrub/osd-scrub-snaps.sh in the test case that can reproduce the problem and it doesn't immediately fail. It seems like the second scrub request can be ignored if it comes it while the first is still pending or in progress.

When it passes the 2 scrubs come in so quickly that the second is in effect ignored.

2018-12-21 08:42:24.865 7f5bfb7bd700 10 osd.0 pg_epoch: 19 pg[1.0( v 15'56 (0'0,15'56] local-lis/les=18/19 n=36 ec=6/6 lis/c 18/18 les/c/f 19/19/0 18/18/18) [0] r=0 lpr=18 crt=15'56 lcod 0'0 mlcod 0'0 active+clean MUST_SCRUB] reg_next_scrub pg 1.0 register next scrub, scrub time 0.000000, must = 1
2018-12-21 08:42:24.957 7f5bfb7bd700 10 osd.0 pg_epoch: 19 pg[1.0( v 15'56 (0'0,15'56] local-lis/les=18/19 n=36 ec=6/6 lis/c 18/18 les/c/f 19/19/0 18/18/18) [0] r=0 lpr=18 crt=15'56 lcod 0'0 mlcod 0'0 active+clean MUST_SCRUB] reg_next_scrub pg 1.0 register next scrub, scrub time 0.000000, must = 1

In the failure, it looks like there is enough delay between them or the scrub runs so quickly that a second scrub is initiated:

2018-12-21 00:14:53.446 7f0ce6bd5700 10 osd.0 pg_epoch: 19 pg[1.0( v 15'56 (0'0,15'56] local-lis/les=18/19 n=36 ec=6/6 lis/c 18/18 les/c/f 19/19/0 18/18/18) [0] r=0 lpr=18 crt=15'56 lcod 0'0 mlcod 0'0 active+clean MUST_SCRUB] reg_next_scrub pg 1.0 register next scrub, scrub time 0.000000, must = 1
2018-12-21 00:14:53.544 7f0ce6bd5700 10 osd.0 pg_epoch: 19 pg[1.0( v 15'56 (0'0,15'56] local-lis/les=18/19 n=36 ec=6/6 lis/c 18/18 les/c/f 19/19/0 18/18/18) [0] r=0 lpr=18 crt=15'56 lcod 0'0 mlcod 0'0 active+clean+scrubbing] reg_next_scrub pg 1.0 register next scrub, scrub time 2018-12-22 02:40:49.196625, must = 0
2018-12-21 00:14:53.568 7f0ce6bd5700 10 osd.0 pg_epoch: 19 pg[1.0( v 15'56 (0'0,15'56] local-lis/les=18/19 n=36 ec=6/6 lis/c 18/18 les/c/f 19/19/0 18/18/18) [0] r=0 lpr=18 crt=15'56 lcod 0'0 mlcod 0'0 active+clean+inconsistent MUST_SCRUB] reg_next_scrub pg 1.0 register next scrub, scrub time 0.000000, must = 1

Actions

Copy link