Bug #47930
scrub/osd-recovery-scrub.sh: TEST_recovery_scrub: wait_background: return 1
0%
Description
2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2046: wait_background: return 1 2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:73: TEST_recovery_scrub: return_code=1 2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:74: TEST_recovery_scrub: '[' 1 -ne 0 ']' 2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:74: TEST_recovery_scrub: return 1 2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:31: run: return 1 2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2153: main: code=1 2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2155: main: teardown td/osd-recovery-scrub 1 2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:164: teardown: local dir=td/osd-recovery-scrub 2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:165: teardown: local dumplogs=1 2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:166: teardown: kill_daemons td/osd-recovery-scrub KILL 2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons: shopt -q -o xtrace 2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons: echo true 2020-10-21T10:17:11.202 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons: local trace=true 2020-10-21T10:17:11.202 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons: true 2020-10-21T10:17:11.202 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons: shopt -u -o xtrace 2020-10-21T10:17:11.305 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:362: kill_daemons: return 0 2020-10-21T10:17:11.306 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:167: teardown: uname 2020-10-21T10:17:11.306 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:167: teardown: '[' Linux '!=' FreeBSD ']' 2020-10-21T10:17:11.307 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:168: teardown: stat -f -c %T . 2020-10-21T10:17:11.308 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:168: teardown: '[' ext2/ext3 == btrfs ']' 2020-10-21T10:17:11.308 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:171: teardown: local cores=no 2020-10-21T10:17:11.308 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:172: teardown: sysctl -n kernel.core_pattern 2020-10-21T10:17:11.309 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:172: teardown: local pattern=/home/ubuntu/cephtest/archive/coredump/%t.%p.core 2020-10-21T10:17:11.309 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:174: teardown: '[' / = '|' ']' 2020-10-21T10:17:11.310 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:180: teardown: grep -q '^core\|core$' 2020-10-21T10:17:11.310 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:180: teardown: dirname /home/ubuntu/cephtest/archive/coredump/%t.%p.core 2020-10-21T10:17:11.310 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:180: teardown: ls /home/ubuntu/cephtest/archive/coredump 2020-10-21T10:17:11.311 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:189: teardown: '[' no = yes -o 1 = 1 ']' 2020-10-21T10:17:11.311 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:190: teardown: '[' -n '' ']' 2020-10-21T10:17:11.312 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:194: teardown: mkdir -p /home/ubuntu/cephtest/archive/log 2020-10-21T10:17:11.312 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:195: teardown: mv td/osd-recovery-scrub/mgr.x.log td/osd-recovery-scrub/mon.a.log td/osd-recovery-scrub/osd.0.log td/osd-recovery-scrub/osd.1.log td/osd-recovery-scrub/osd.2.log td/osd-recovery-scrub/osd.3.log td/osd-recovery-scrub/osd.4.log td/osd-recovery-scrub/osd.5.log td/osd-recovery-scrub/osd.6.log td/osd-recovery-scrub/osd.7.log /home/ubuntu/cephtest/archive/log 2020-10-21T10:17:11.313 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:198: teardown: rm -fr td/osd-recovery-scrub 2020-10-21T10:17:11.320 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:199: teardown: get_asok_dir 2020-10-21T10:17:11.320 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir: '[' -n '' ']' 2020-10-21T10:17:11.321 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir: echo /tmp/ceph-asok.27344 2020-10-21T10:17:11.321 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:199: teardown: rm -rf /tmp/ceph-asok.27344 2020-10-21T10:17:11.322 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:200: teardown: '[' no = yes ']' 2020-10-21T10:17:11.322 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:207: teardown: return 0 2020-10-21T10:17:11.322 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2156: main: return 1 2020-10-21T10:17:11.323 DEBUG:teuthology.orchestra.run:got remote process result: 1 2020-10-21T10:17:11.323 INFO:tasks.workunit:Stopping ['scrub'] on client.0...
/a/teuthology-2020-10-21_07:01:02-rados-master-distro-basic-smithi/5544900
Related issues
History
#1 Updated by David Zafman over 3 years ago
Before scrubs were started in the background, not all PGs were in recovery. But somehow in this case the scrubs, probably blocked by recovery, didn't finish in time.
Failure case recovery isn't starting properly:
2020-10-21T10:10:31.807 INFO:tasks.workunit.client.0.smithi186.stdout:PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 2020-10-21T10:10:31.807 INFO:tasks.workunit.client.0.smithi186.stdout:1.1f 1 0 3 0 0 52428800 0 0 13 13 activating 2020-10-21T10:10:30.277259+0000 49'13 53:31 [5,0,2,1] 5 [5,0,2,1] 5 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1e 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.999906+0000 0'0 48:11 [3] 3 [3] 3 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1d 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.579584+0000 0'0 48:11 [1] 1 [1] 1 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1c 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.165020+0000 0'0 46:11 [0] 0 [0] 0 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1b 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.165451+0000 0'0 46:11 [0] 0 [0] 0 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1a 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.578834+0000 0'0 48:11 [1] 1 [1] 1 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.19 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.864049+0000 0'0 48:11 [3] 3 [3] 3 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.18 0 0 0 0 0 0 0 0 0 0 activating 2020-10-21T10:10:30.294275+0000 0'0 53:18 [5,2,3,6] 5 [5,2,3,6] 5 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.17 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.008951+0000 0'0 46:11 [0] 0 [0] 0 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.16 0 0 0 0 0 0 0 0 0 0 peering 2020-10-21T10:10:28.934322+0000 0'0 50:11 [6,0,7,4] 6 [6,0,7,4] 6 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.15 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.428176+0000 0'0 48:11 [1] 1 [1] 1 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.14 0 0 0 0 0 0 0 0 0 0 peering 2020-10-21T10:10:28.932213+0000 0'0 50:14 [2,0,4,5] 2 [2,0,4,5] 2 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.13 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.926662+0000 0'0 47:11 [4] 4 [4] 4 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.12 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.165449+0000 0'0 46:11 [0] 0 [0] 0 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.11 1 0 0 0 0 52428800 0 0 13 13 active+clean 2020-10-21T10:10:14.008850+0000 49'13 49:24 [0] 0 [0] 0 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.10 1 0 3 0 0 52428800 0 0 13 13 activating 2020-10-21T10:10:30.278627+0000 49'13 53:31 [5,7,6,0] 5 [5,7,6,0] 5 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.5 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.838496+0000 0'0 47:11 [4] 4 [4] 4 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.4 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.999973+0000 0'0 48:11 [3] 3 [3] 3 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.3 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.926650+0000 0'0 47:11 [4] 4 [4] 4 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.2 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.165020+0000 0'0 46:11 [0] 0 [0] 0 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.0 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:15.543612+0000 0'0 48:11 [7] 7 [7] 7 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.811 INFO:tasks.workunit.client.0.smithi186.stdout:1.1 0 0 0 0 0 0 0 0 0 0 activating 2020-10-21T10:10:30.277801+0000 0'0 53:18 [5,0,4,1] 5 [5,0,4,1] 5 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.811 INFO:tasks.workunit.client.0.smithi186.stdout:1.6 0 0 0 0 0 0 0 0 0 0 peering 2020-10-21T10:10:28.933297+0000 0'0 50:11 [6,7,3,4] 6 [6,7,3,4] 6 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.811 INFO:tasks.workunit.client.0.smithi186.stdout:1.7 0 0 0 0 0 0 0 0 0 0 peering 2020-10-21T10:10:28.934903+0000 0'0 50:11 [6,7,3,2] 6 [6,7,3,2] 6 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.8 1 0 3 0 0 52428800 0 0 13 13 activating 2020-10-21T10:10:30.280176+0000 49'13 53:31 [5,4,3,6] 5 [5,4,3,6] 5 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.9 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.579478+0000 0'0 48:11 [1] 1 [1] 1 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.a 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.578889+0000 0'0 48:11 [1] 1 [1] 1 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.b 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.579381+0000 0'0 48:11 [1] 1 [1] 1 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.c 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.999937+0000 0'0 48:11 [3] 3 [3] 3 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.d 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:15.543603+0000 0'0 48:11 [7] 7 [7] 7 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.813 INFO:tasks.workunit.client.0.smithi186.stdout:1.e 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T10:10:14.578787+0000 0'0 48:11 [1] 1 [1] 1 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0 2020-10-21T10:10:31.813 INFO:tasks.workunit.client.0.smithi186.stdout:1.f 0 0 0 0 0 0 0 0 0 0 peering 2020-10-21T10:10:28.932569+0000 0'0 50:11 [6,0,3,4] 6 [6,0,3,4] 6 0'0 2020-10-21T10:10:13.367340+0000 0'0 2020-10-21T10:10:13.367340+0000 0
From my local run since recovery is started by increase the pool size up to 4. Here every PG is either active+clean, recovery_wait or recovering.
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 1.1f 1 0 3 0 0 52428800 0 0 13 13 active+recovery_wait+degraded 2020-10-21T22:22:38.766809+0000 47'13 50:34 [5,0,2,1] 5 [5,0,2,1] 5 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.1e 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.420320+0000 0'0 49:23 [3,0,7,4] 3 [3,0,7,4] 3 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.1d 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.448878+0000 0'0 49:23 [1,3,0,2] 1 [1,3,0,2] 1 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.1c 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.326899+0000 0'0 49:23 [0,3,4,1] 0 [0,3,4,1] 0 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.1b 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.251622+0000 0'0 49:23 [0,1,2,7] 0 [0,1,2,7] 0 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.1a 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.658852+0000 0'0 50:23 [1,7,4,2] 1 [1,7,4,2] 1 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.19 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.419342+0000 0'0 49:23 [3,0,4,2] 3 [3,0,4,2] 3 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.18 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:38.475363+0000 0'0 50:23 [5,2,3,6] 5 [5,2,3,6] 5 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.17 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.738434+0000 0'0 50:23 [0,4,5,6] 0 [0,4,5,6] 0 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.9 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.796909+0000 0'0 50:23 [1,4,5,2] 1 [1,4,5,2] 1 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.8 1 0 3 0 0 52428800 0 0 13 13 active+recovery_wait+degraded 2020-10-21T22:22:38.497747+0000 47'13 50:34 [5,4,3,6] 5 [5,4,3,6] 5 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.7 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.467696+0000 0'0 49:23 [6,7,3,2] 6 [6,7,3,2] 6 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.6 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.369515+0000 0'0 49:23 [6,7,3,4] 6 [6,7,3,4] 6 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.5 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.459810+0000 0'0 49:24 [4,2,6,5] 4 [4,2,6,5] 4 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.a 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.451781+0000 0'0 49:23 [1,6,2,4] 1 [1,6,2,4] 1 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.1 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:38.767540+0000 0'0 50:23 [5,0,4,1] 5 [5,0,4,1] 5 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.16 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.466511+0000 0'0 49:23 [6,0,7,4] 6 [6,0,7,4] 6 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.4 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.645665+0000 0'0 49:23 [3,0,6,1] 3 [3,0,6,1] 3 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.0 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.412391+0000 0'0 49:22 [7,0,2,3] 7 [7,0,2,3] 7 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.2 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.460981+0000 0'0 49:23 [0,7,1,3] 0 [0,7,1,3] 0 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.3 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.553270+0000 0'0 49:24 [4,1,2,6] 4 [4,1,2,6] 4 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.b 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.459097+0000 0'0 50:23 [1,0,2,6] 1 [1,0,2,6] 1 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.c 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.644178+0000 0'0 49:23 [3,0,4,6] 3 [3,0,4,6] 3 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.d 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.452305+0000 0'0 49:22 [7,4,2,1] 7 [7,4,2,1] 7 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.e 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.471397+0000 0'0 50:23 [1,4,2,0] 1 [1,4,2,0] 1 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.f 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.445116+0000 0'0 49:23 [6,0,3,4] 6 [6,0,3,4] 6 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.10 1 0 3 0 0 52428800 0 0 13 13 active+recovery_wait+degraded 2020-10-21T22:22:38.768770+0000 47'13 50:34 [5,7,6,0] 5 [5,7,6,0] 5 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.11 1 0 3 0 0 52428800 0 0 13 13 active+recovering+degraded 2020-10-21T22:22:37.880344+0000 47'13 50:35 [0,1,3,4] 0 [0,1,3,4] 0 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.12 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.258836+0000 0'0 49:23 [0,1,6,7] 0 [0,1,6,7] 0 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.13 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.571128+0000 0'0 49:24 [4,6,3,1] 4 [4,6,3,1] 4 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.14 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.395515+0000 0'0 49:23 [2,0,4,5] 2 [2,0,4,5] 2 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0 1.15 0 0 0 0 0 0 0 0 0 0 active+clean 2020-10-21T22:22:37.765700+0000 0'0 50:23 [1,5,3,0] 1 [1,5,3,0] 1 0'0 2020-10-21T22:22:21.954917+0000 0'0 2020-10-21T22:22:21.954917+0000 0
#2 Updated by Neha Ojha over 3 years ago
/a/kchai-2020-10-21_07:01:44-rados-wip-kefu-testing-2020-10-21-1144-distro-basic-smithi/5545065
#3 Updated by Neha Ojha over 3 years ago
- Priority changed from Normal to Urgent
/a/teuthology-2020-10-23_07:01:02-rados-master-distro-basic-smithi/5550707
#4 Updated by Neha Ojha over 3 years ago
- Status changed from New to Triaged
/a/teuthology-2020-10-21_07:01:02-rados-master-distro-basic-smithi/5544900 - here the failure occurred because the last_scrub_stamp stamp for 1.1b never got updated, meaning the pg wasn't scrubbed as expected.
2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1464: get_last_scrub_stamp: local pgid=1.1b 2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1465: get_last_scrub_stamp: local sname=last_scrub_stamp 2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1466: get_last_scrub_stamp: ceph --format json pg dump pgs 2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1467: get_last_scrub_stamp: jq -r '.pg_stats | .[] | select(.pgid=="1.1b") | .last_scrub_stamp' 2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1884: wait_for_scrub: test 2020-10-21T10:10:13.367340+0000 '>' 2020-10-21T10:10:13.367340+0000 2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1887: wait_for_scrub: sleep 1 2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1883: wait_for_scrub: (( i++ )) 2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1883: wait_for_scrub: (( i < 300 )) 2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1889: wait_for_scrub: return 1
This is where the PG was marked for scrub.
2020-10-21T10:10:33.482+0000 7f6bac7c5700 1 -- [v2:127.0.0.1:6802/28162,v1:127.0.0.1:6803/28162] <== mgr.4100 v2:127.0.0.1:6800/27489 4 ==== scrub2([1.1b]) v1 ==== 50+0+0 (crc 0 0 0) 0x55a979f95980 con 0x55a97a006000 2020-10-21T10:10:33.482+0000 7f6bac7c5700 10 osd.0 53 handle_fast_scrub scrub2([1.1b]) v1 2020-10-21T10:10:33.482+0000 7f6bac7c5700 15 osd.0 53 enqueue_peering_evt 1.1b epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow 2020-10-21T10:10:33.482+0000 7f6bac7c5700 20 osd.0 op_wq(2) _enqueue OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53) 2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process 1.1b to_process <> waiting <> waiting_peering {} 2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53) queued 2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process 1.1b to_process <OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53)> waiting <> waiting_peering {} 2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53) pg 0x55a97a37d400 2020-10-21T10:10:33.482+0000 7f6b90ed8700 10 osd.0 pg_epoch: 53 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean] do_peering_event: epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow 2020-10-21T10:10:33.482+0000 7f6b90ed8700 10 osd.0 pg_epoch: 53 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean MUST_SCRUB REQ_SCRUB] reg_next_scrub pg 1.1b register next scrub, scrub time 0.000000, must = 1 2020-10-21T10:10:33.482+0000 7f6b90ed8700 10 osd.0 pg_epoch: 53 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean MUST_SCRUB REQ_SCRUB] state<Started/Primary>: marking for scrub
There is an acting set change that's initiated here
2020-10-21T10:11:11.326+0000 7f6b90ed8700 20 osd.0 pg_epoch: 54 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean MUST_SCRUB REQ_SCRUB] new interval newup [5,1,2,7] newacting [5,1,2,7]
after which the MUST_SCRUB REQ_SCRUB flags get cleared as a part of "PrimaryLogPG::on_change"
2020-10-21T10:11:11.326+0000 7f6b90ed8700 10 osd.0 pg_epoch: 54 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=54 pruub=14.961838269s) [5,1,2,7] r=-1 lpr=54 pi=[50,54)/1 crt=0'0 mlcod 0'0 unknown pruub 134.277458496s@ MUST_SCRUB REQ_SCRUB mbc={}] cancel_manifest_ops 2020-10-21T10:11:11.326+0000 7f6b90ed8700 10 osd.0 pg_epoch: 54 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=54 pruub=14.961838269s) [5,1,2,7] r=-1 lpr=54 pi=[50,54)/1 crt=0'0 mlcod 0'0 unknown pruub 134.277458496s@ mbc={}] on_change_cleanup
Thus, the pg never gets scrubbed.
#5 Updated by David Zafman over 3 years ago
We only need 1 pool with 1 pg, if we orchestrate carefully. The existing test is more like a shotgun, sending lots of scrubs assuming 1 will hit the target.
We would benefit having more and maybe bigger objects in each pool to make sure each scrub runs while recovery is still going.
#6 Updated by David Zafman over 3 years ago
- Assignee set to David Zafman
#7 Updated by David Zafman over 3 years ago
- Status changed from Triaged to In Progress
#8 Updated by David Zafman over 3 years ago
- Pull request ID set to 37908
#9 Updated by David Zafman over 3 years ago
Enhancement:
Possible test cases to replace existing test
1. Simple test for "not scheduling scrubs due to active recovery"
2. osd_scrub_during_recovery=true make sure scrub happens
Additional test cases see https://tracker.ceph.com/issues/48173
3. Remote reservation non-overlapping PGs start recovery on PG that has a replica
4. failed local (need sleep in OSD::sched_scrub)
#10 Updated by Neha Ojha over 3 years ago
/a/teuthology-2020-11-04_07:01:02-rados-master-distro-basic-smithi/5590019
#11 Updated by David Zafman over 3 years ago
- Related to Bug #48173: Additional test cases for osd-recovery-scrub.sh added
#12 Updated by Neha Ojha over 3 years ago
- Status changed from In Progress to Fix Under Review
#13 Updated by Kefu Chai over 3 years ago
- Status changed from Fix Under Review to Resolved