Project

General

Profile

Bug #47930

scrub/osd-recovery-scrub.sh: TEST_recovery_scrub: wait_background: return 1

Added by Neha Ojha over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2046: wait_background:  return 1
2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:73: TEST_recovery_scrub:  return_code=1
2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:74: TEST_recovery_scrub:  '[' 1 -ne 0 ']'
2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:74: TEST_recovery_scrub:  return 1
2020-10-21T10:17:11.200 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-recovery-scrub.sh:31: run:  return 1
2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2153: main:  code=1
2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2155: main:  teardown td/osd-recovery-scrub 1
2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:164: teardown:  local dir=td/osd-recovery-scrub
2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:165: teardown:  local dumplogs=1
2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:166: teardown:  kill_daemons td/osd-recovery-scrub KILL
2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons:  shopt -q -o xtrace
2020-10-21T10:17:11.201 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons:  echo true
2020-10-21T10:17:11.202 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:345: kill_daemons:  local trace=true
2020-10-21T10:17:11.202 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons:  true
2020-10-21T10:17:11.202 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:346: kill_daemons:  shopt -u -o xtrace
2020-10-21T10:17:11.305 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:362: kill_daemons:  return 0
2020-10-21T10:17:11.306 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:167: teardown:  uname
2020-10-21T10:17:11.306 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:167: teardown:  '[' Linux '!=' FreeBSD ']'
2020-10-21T10:17:11.307 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:168: teardown:  stat -f -c %T .
2020-10-21T10:17:11.308 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:168: teardown:  '[' ext2/ext3 == btrfs ']'
2020-10-21T10:17:11.308 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:171: teardown:  local cores=no
2020-10-21T10:17:11.308 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:172: teardown:  sysctl -n kernel.core_pattern
2020-10-21T10:17:11.309 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:172: teardown:  local pattern=/home/ubuntu/cephtest/archive/coredump/%t.%p.core
2020-10-21T10:17:11.309 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:174: teardown:  '[' / = '|' ']'
2020-10-21T10:17:11.310 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:180: teardown:  grep -q '^core\|core$'
2020-10-21T10:17:11.310 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:180: teardown:  dirname /home/ubuntu/cephtest/archive/coredump/%t.%p.core
2020-10-21T10:17:11.310 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:180: teardown:  ls /home/ubuntu/cephtest/archive/coredump
2020-10-21T10:17:11.311 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:189: teardown:  '[' no = yes -o 1 = 1 ']'
2020-10-21T10:17:11.311 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:190: teardown:  '[' -n '' ']'
2020-10-21T10:17:11.312 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:194: teardown:  mkdir -p /home/ubuntu/cephtest/archive/log
2020-10-21T10:17:11.312 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:195: teardown:  mv td/osd-recovery-scrub/mgr.x.log td/osd-recovery-scrub/mon.a.log td/osd-recovery-scrub/osd.0.log td/osd-recovery-scrub/osd.1.log td/osd-recovery-scrub/osd.2.log td/osd-recovery-scrub/osd.3.log td/osd-recovery-scrub/osd.4.log td/osd-recovery-scrub/osd.5.log td/osd-recovery-scrub/osd.6.log td/osd-recovery-scrub/osd.7.log /home/ubuntu/cephtest/archive/log
2020-10-21T10:17:11.313 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:198: teardown:  rm -fr td/osd-recovery-scrub
2020-10-21T10:17:11.320 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:199: teardown:  get_asok_dir
2020-10-21T10:17:11.320 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir:  '[' -n '' ']'
2020-10-21T10:17:11.321 INFO:tasks.workunit.client.0.smithi186.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir:  echo /tmp/ceph-asok.27344
2020-10-21T10:17:11.321 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:199: teardown:  rm -rf /tmp/ceph-asok.27344
2020-10-21T10:17:11.322 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:200: teardown:  '[' no = yes ']'
2020-10-21T10:17:11.322 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:207: teardown:  return 0
2020-10-21T10:17:11.322 INFO:tasks.workunit.client.0.smithi186.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2156: main:  return 1
2020-10-21T10:17:11.323 DEBUG:teuthology.orchestra.run:got remote process result: 1
2020-10-21T10:17:11.323 INFO:tasks.workunit:Stopping ['scrub'] on client.0...

/a/teuthology-2020-10-21_07:01:02-rados-master-distro-basic-smithi/5544900


Related issues

Related to RADOS - Bug #48173: Additional test cases for osd-recovery-scrub.sh New

History

#1 Updated by David Zafman over 3 years ago

Before scrubs were started in the background, not all PGs were in recovery. But somehow in this case the scrubs, probably blocked by recovery, didn't finish in time.

Failure case recovery isn't starting properly:

2020-10-21T10:10:31.807 INFO:tasks.workunit.client.0.smithi186.stdout:PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES     OMAP_BYTES*  OMAP_KEYS*  LOG  DISK_LOG  STATE         STATE_STAMP                      VERSION  REPORTED  UP         UP_PRIMARY  ACTING     ACTING_PRIMARY  LAST_SCRUB  SCRUB_STAMP                      LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN
2020-10-21T10:10:31.807 INFO:tasks.workunit.client.0.smithi186.stdout:1.1f           1                   0         3          0        0  52428800            0           0   13        13    activating  2020-10-21T10:10:30.277259+0000    49'13     53:31  [5,0,2,1]           5  [5,0,2,1]               5         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1e           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.999906+0000      0'0     48:11        [3]           3        [3]               3         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1d           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.579584+0000      0'0     48:11        [1]           1        [1]               1         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1c           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.165020+0000      0'0     46:11        [0]           0        [0]               0         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1b           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.165451+0000      0'0     46:11        [0]           0        [0]               0         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.1a           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.578834+0000      0'0     48:11        [1]           1        [1]               1         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.19           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.864049+0000      0'0     48:11        [3]           3        [3]               3         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.808 INFO:tasks.workunit.client.0.smithi186.stdout:1.18           0                   0         0          0        0         0            0           0    0         0    activating  2020-10-21T10:10:30.294275+0000      0'0     53:18  [5,2,3,6]           5  [5,2,3,6]               5         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.17           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.008951+0000      0'0     46:11        [0]           0        [0]               0         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.16           0                   0         0          0        0         0            0           0    0         0       peering  2020-10-21T10:10:28.934322+0000      0'0     50:11  [6,0,7,4]           6  [6,0,7,4]               6         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.15           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.428176+0000      0'0     48:11        [1]           1        [1]               1         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.14           0                   0         0          0        0         0            0           0    0         0       peering  2020-10-21T10:10:28.932213+0000      0'0     50:14  [2,0,4,5]           2  [2,0,4,5]               2         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.13           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.926662+0000      0'0     47:11        [4]           4        [4]               4         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.809 INFO:tasks.workunit.client.0.smithi186.stdout:1.12           0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.165449+0000      0'0     46:11        [0]           0        [0]               0         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.11           1                   0         0          0        0  52428800            0           0   13        13  active+clean  2020-10-21T10:10:14.008850+0000    49'13     49:24        [0]           0        [0]               0         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.10           1                   0         3          0        0  52428800            0           0   13        13    activating  2020-10-21T10:10:30.278627+0000    49'13     53:31  [5,7,6,0]           5  [5,7,6,0]               5         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.5            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.838496+0000      0'0     47:11        [4]           4        [4]               4         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.4            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.999973+0000      0'0     48:11        [3]           3        [3]               3         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.3            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.926650+0000      0'0     47:11        [4]           4        [4]               4         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.2            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.165020+0000      0'0     46:11        [0]           0        [0]               0         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.810 INFO:tasks.workunit.client.0.smithi186.stdout:1.0            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:15.543612+0000      0'0     48:11        [7]           7        [7]               7         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.811 INFO:tasks.workunit.client.0.smithi186.stdout:1.1            0                   0         0          0        0         0            0           0    0         0    activating  2020-10-21T10:10:30.277801+0000      0'0     53:18  [5,0,4,1]           5  [5,0,4,1]               5         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.811 INFO:tasks.workunit.client.0.smithi186.stdout:1.6            0                   0         0          0        0         0            0           0    0         0       peering  2020-10-21T10:10:28.933297+0000      0'0     50:11  [6,7,3,4]           6  [6,7,3,4]               6         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.811 INFO:tasks.workunit.client.0.smithi186.stdout:1.7            0                   0         0          0        0         0            0           0    0         0       peering  2020-10-21T10:10:28.934903+0000      0'0     50:11  [6,7,3,2]           6  [6,7,3,2]               6         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.8            1                   0         3          0        0  52428800            0           0   13        13    activating  2020-10-21T10:10:30.280176+0000    49'13     53:31  [5,4,3,6]           5  [5,4,3,6]               5         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.9            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.579478+0000      0'0     48:11        [1]           1        [1]               1         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.a            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.578889+0000      0'0     48:11        [1]           1        [1]               1         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.b            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.579381+0000      0'0     48:11        [1]           1        [1]               1         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.c            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.999937+0000      0'0     48:11        [3]           3        [3]               3         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.812 INFO:tasks.workunit.client.0.smithi186.stdout:1.d            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:15.543603+0000      0'0     48:11        [7]           7        [7]               7         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.813 INFO:tasks.workunit.client.0.smithi186.stdout:1.e            0                   0         0          0        0         0            0           0    0         0  active+clean  2020-10-21T10:10:14.578787+0000      0'0     48:11        [1]           1        [1]               1         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0
2020-10-21T10:10:31.813 INFO:tasks.workunit.client.0.smithi186.stdout:1.f            0                   0         0          0        0         0            0           0    0         0       peering  2020-10-21T10:10:28.932569+0000      0'0     50:11  [6,0,3,4]           6  [6,0,3,4]               6         0'0  2020-10-21T10:10:13.367340+0000              0'0  2020-10-21T10:10:13.367340+0000              0

From my local run since recovery is started by increase the pool size up to 4. Here every PG is either active+clean, recovery_wait or recovering.

PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES     OMAP_BYTES*  OMAP_KEYS*  LOG  DISK_LOG  STATE                          STATE_STAMP                      VERSION  REPORTED  UP         UP_PRIMARY  ACTING     ACTING_PRIMARY  LAST_SCRUB  SCRUB_STAMP                      LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN
1.1f           1                   0         3          0        0  52428800            0           0   13        13  active+recovery_wait+degraded  2020-10-21T22:22:38.766809+0000    47'13     50:34  [5,0,2,1]           5  [5,0,2,1]               5         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.1e           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.420320+0000      0'0     49:23  [3,0,7,4]           3  [3,0,7,4]               3         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.1d           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.448878+0000      0'0     49:23  [1,3,0,2]           1  [1,3,0,2]               1         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.1c           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.326899+0000      0'0     49:23  [0,3,4,1]           0  [0,3,4,1]               0         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.1b           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.251622+0000      0'0     49:23  [0,1,2,7]           0  [0,1,2,7]               0         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.1a           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.658852+0000      0'0     50:23  [1,7,4,2]           1  [1,7,4,2]               1         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.19           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.419342+0000      0'0     49:23  [3,0,4,2]           3  [3,0,4,2]               3         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.18           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:38.475363+0000      0'0     50:23  [5,2,3,6]           5  [5,2,3,6]               5         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.17           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.738434+0000      0'0     50:23  [0,4,5,6]           0  [0,4,5,6]               0         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.9            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.796909+0000      0'0     50:23  [1,4,5,2]           1  [1,4,5,2]               1         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.8            1                   0         3          0        0  52428800            0           0   13        13  active+recovery_wait+degraded  2020-10-21T22:22:38.497747+0000    47'13     50:34  [5,4,3,6]           5  [5,4,3,6]               5         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.7            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.467696+0000      0'0     49:23  [6,7,3,2]           6  [6,7,3,2]               6         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.6            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.369515+0000      0'0     49:23  [6,7,3,4]           6  [6,7,3,4]               6         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.5            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.459810+0000      0'0     49:24  [4,2,6,5]           4  [4,2,6,5]               4         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.a            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.451781+0000      0'0     49:23  [1,6,2,4]           1  [1,6,2,4]               1         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.1            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:38.767540+0000      0'0     50:23  [5,0,4,1]           5  [5,0,4,1]               5         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.16           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.466511+0000      0'0     49:23  [6,0,7,4]           6  [6,0,7,4]               6         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.4            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.645665+0000      0'0     49:23  [3,0,6,1]           3  [3,0,6,1]               3         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.0            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.412391+0000      0'0     49:22  [7,0,2,3]           7  [7,0,2,3]               7         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.2            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.460981+0000      0'0     49:23  [0,7,1,3]           0  [0,7,1,3]               0         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.3            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.553270+0000      0'0     49:24  [4,1,2,6]           4  [4,1,2,6]               4         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.b            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.459097+0000      0'0     50:23  [1,0,2,6]           1  [1,0,2,6]               1         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.c            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.644178+0000      0'0     49:23  [3,0,4,6]           3  [3,0,4,6]               3         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.d            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.452305+0000      0'0     49:22  [7,4,2,1]           7  [7,4,2,1]               7         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.e            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.471397+0000      0'0     50:23  [1,4,2,0]           1  [1,4,2,0]               1         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.f            0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.445116+0000      0'0     49:23  [6,0,3,4]           6  [6,0,3,4]               6         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.10           1                   0         3          0        0  52428800            0           0   13        13  active+recovery_wait+degraded  2020-10-21T22:22:38.768770+0000    47'13     50:34  [5,7,6,0]           5  [5,7,6,0]               5         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.11           1                   0         3          0        0  52428800            0           0   13        13     active+recovering+degraded  2020-10-21T22:22:37.880344+0000    47'13     50:35  [0,1,3,4]           0  [0,1,3,4]               0         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.12           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.258836+0000      0'0     49:23  [0,1,6,7]           0  [0,1,6,7]               0         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.13           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.571128+0000      0'0     49:24  [4,6,3,1]           4  [4,6,3,1]               4         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.14           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.395515+0000      0'0     49:23  [2,0,4,5]           2  [2,0,4,5]               2         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0
1.15           0                   0         0          0        0         0            0           0    0         0                   active+clean  2020-10-21T22:22:37.765700+0000      0'0     50:23  [1,5,3,0]           1  [1,5,3,0]               1         0'0  2020-10-21T22:22:21.954917+0000              0'0  2020-10-21T22:22:21.954917+0000              0

#2 Updated by Neha Ojha over 3 years ago

/a/kchai-2020-10-21_07:01:44-rados-wip-kefu-testing-2020-10-21-1144-distro-basic-smithi/5545065

#3 Updated by Neha Ojha over 3 years ago

  • Priority changed from Normal to Urgent

/a/teuthology-2020-10-23_07:01:02-rados-master-distro-basic-smithi/5550707

#4 Updated by Neha Ojha over 3 years ago

  • Status changed from New to Triaged

/a/teuthology-2020-10-21_07:01:02-rados-master-distro-basic-smithi/5544900 - here the failure occurred because the last_scrub_stamp stamp for 1.1b never got updated, meaning the pg wasn't scrubbed as expected.

2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1464: get_last_scrub_stamp:  local pgid=1.1b
2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1465: get_last_scrub_stamp:  local sname=last_scrub_stamp
2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1466: get_last_scrub_stamp:  ceph --format json pg dump pgs
2020-10-21T10:17:11.197 INFO:tasks.workunit.client.0.smithi186.stderr:27344: //home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1467: get_last_scrub_stamp:  jq -r '.pg_stats | .[] | select(.pgid=="1.1b") | .last_scrub_stamp'
2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1884: wait_for_scrub:  test 2020-10-21T10:10:13.367340+0000 '>' 2020-10-21T10:10:13.367340+0000
2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1887: wait_for_scrub:  sleep 1
2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1883: wait_for_scrub:  (( i++ ))
2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1883: wait_for_scrub:  (( i < 300 ))
2020-10-21T10:17:11.198 INFO:tasks.workunit.client.0.smithi186.stderr:27344: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1889: wait_for_scrub:  return 1

This is where the PG was marked for scrub.

2020-10-21T10:10:33.482+0000 7f6bac7c5700  1 -- [v2:127.0.0.1:6802/28162,v1:127.0.0.1:6803/28162] <== mgr.4100 v2:127.0.0.1:6800/27489 4 ==== scrub2([1.1b]) v1 ==== 50+0+0 (crc 0 0 0) 0x55a979f95980 con 0x55a97a006000
2020-10-21T10:10:33.482+0000 7f6bac7c5700 10 osd.0 53 handle_fast_scrub scrub2([1.1b]) v1
2020-10-21T10:10:33.482+0000 7f6bac7c5700 15 osd.0 53 enqueue_peering_evt 1.1b epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow
2020-10-21T10:10:33.482+0000 7f6bac7c5700 20 osd.0 op_wq(2) _enqueue OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53)
2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process 1.1b to_process <> waiting <> waiting_peering {}
2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53) queued
2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process 1.1b to_process <OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53)> waiting <> waiting_peering {}
2020-10-21T10:10:33.482+0000 7f6b90ed8700 20 osd.0 op_wq(2) _process OpSchedulerItem(1.1b PGPeeringEvent(epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow) prio 255 cost 10 e53) pg 0x55a97a37d400
2020-10-21T10:10:33.482+0000 7f6b90ed8700 10 osd.0 pg_epoch: 53 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean] do_peering_event: epoch_sent: 53 epoch_requested: 53 RequestScrub(shallow
2020-10-21T10:10:33.482+0000 7f6b90ed8700 10 osd.0 pg_epoch: 53 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean MUST_SCRUB REQ_SCRUB] reg_next_scrub pg 1.1b register next scrub, scrub time 0.000000, must = 1
2020-10-21T10:10:33.482+0000 7f6b90ed8700 10 osd.0 pg_epoch: 53 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean MUST_SCRUB REQ_SCRUB] state<Started/Primary>: marking for scrub

There is an acting set change that's initiated here

2020-10-21T10:11:11.326+0000 7f6b90ed8700 20 osd.0 pg_epoch: 54 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=50) [0,1,2,7] r=0 lpr=50 crt=0'0 mlcod 0'0 active+clean MUST_SCRUB REQ_SCRUB] new interval newup [5,1,2,7] newacting [5,1,2,7]

after which the MUST_SCRUB REQ_SCRUB flags get cleared as a part of "PrimaryLogPG::on_change"

2020-10-21T10:11:11.326+0000 7f6b90ed8700 10 osd.0 pg_epoch: 54 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=54 pruub=14.961838269s) [5,1,2,7] r=-1 lpr=54 pi=[50,54)/1 crt=0'0 mlcod 0'0 unknown pruub 134.277458496s@ MUST_SCRUB REQ_SCRUB mbc={}] cancel_manifest_ops
2020-10-21T10:11:11.326+0000 7f6b90ed8700 10 osd.0 pg_epoch: 54 pg[1.1b( empty local-lis/les=50/52 n=0 ec=42/42 lis/c=50/50 les/c/f=52/52/0 sis=54 pruub=14.961838269s) [5,1,2,7] r=-1 lpr=54 pi=[50,54)/1 crt=0'0 mlcod 0'0 unknown pruub 134.277458496s@ mbc={}] on_change_cleanup

Thus, the pg never gets scrubbed.

#5 Updated by David Zafman over 3 years ago

We only need 1 pool with 1 pg, if we orchestrate carefully. The existing test is more like a shotgun, sending lots of scrubs assuming 1 will hit the target.

We would benefit having more and maybe bigger objects in each pool to make sure each scrub runs while recovery is still going.

#6 Updated by David Zafman over 3 years ago

  • Assignee set to David Zafman

#7 Updated by David Zafman over 3 years ago

  • Status changed from Triaged to In Progress

#8 Updated by David Zafman over 3 years ago

  • Pull request ID set to 37908

#9 Updated by David Zafman over 3 years ago

Enhancement:

Possible test cases to replace existing test
1. Simple test for "not scheduling scrubs due to active recovery"
2. osd_scrub_during_recovery=true make sure scrub happens

Additional test cases see https://tracker.ceph.com/issues/48173
3. Remote reservation non-overlapping PGs start recovery on PG that has a replica
4. failed local (need sleep in OSD::sched_scrub)

#10 Updated by Neha Ojha over 3 years ago

/a/teuthology-2020-11-04_07:01:02-rados-master-distro-basic-smithi/5590019

#11 Updated by David Zafman over 3 years ago

  • Related to Bug #48173: Additional test cases for osd-recovery-scrub.sh added

#12 Updated by Neha Ojha over 3 years ago

  • Status changed from In Progress to Fix Under Review

#13 Updated by Kefu Chai over 3 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF