Bug #44314
osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin)
0%
Description
2020-02-26T14:02:57.914 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress: local progress=null 2020-02-26T14:02:57.915 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1277: get_is_making_recovery_progress: test null '!=' null 2020-02-26T14:02:57.915 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1601: wait_for_clean: (( 13 >= 13 )) 2020-02-26T14:02:57.915 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1602: wait_for_clean: ceph report ... 2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:report 2563389435 2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1603: wait_for_clean: return 1 2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:732: TEST_backfill_ec_down_out: return 1 2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:37: run: return 1
/a/sage-2020-02-26_08:10:43-rados-wip-sage2-testing-2020-02-25-2110-distro-basic-smithi/4804098
Seems like about ~30% failure rate (very roughly)?
Related issues
History
#1 Updated by Neha Ojha over 3 years ago
- Assignee set to David Zafman
#2 Updated by David Zafman over 3 years ago
The unset of nobackfill happened after an attempt to start backfill was initiated and it deferred due to the flag. After it was unset, backfilling should have been requeued.
2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 44 do_recovery starting 1 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] 2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 pg_epoch: 44 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] recover_replicas(1) 2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 pg_epoch: 44 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] recover_primary recovering 0 in pg, missing missing(0 may_include_deletes = 1) 2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 pg_epoch: 44 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] deferring backfill due to NOBACKFILL
2020-02-26T14:01:01.797 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:728: TEST_backfill_ec_down_out: ceph osd unset nobackfill 2020-02-26T14:01:02.651 INFO:tasks.workunit.client.0.smithi061.stderr:nobackfill is unset
#3 Updated by David Zafman over 3 years ago
The kick_recovery_wq didn't get backfill restarted on the failed run. Or a recovery attempt (periodic?) was somehow not scheduled.
Test code:
ceph osd unset nobackfill ceph tell osd.$primary debug kick_recovery_wq 0 sleep 2
There is no "do_recovery wake up ..." after the kick above in the failed run.
Running the test locally passes because it restarts recovery like this:
2020-02-28T15:52:42.777-0800 7f4e355f3700 10 osd.1 pg_epoch: 42 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] deferring backfill due to NOBACKFILL 2020-02-28T15:52:42.777-0800 7f4e355f3700 10 osd.1 42 do_recovery started 0/1 on pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] 2020-02-28T15:52:44.489-0800 7f4e355f3700 10 osd.1 pg_epoch: 42 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] handle_advance_map: 43 2020-02-28T15:52:44.489-0800 7f4e355f3700 10 osd.1 pg_epoch: 42 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] handle_advance_map [1,3,2]/[1,2147483647,2] -- 1/1 2020-02-28T15:52:44.489-0800 7f4e355f3700 10 osd.1 pg_epoch: 43 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] Active: kicking snap trim 2020-02-28T15:52:44.593-0800 7f4e4c621700 20 osd.1 43 do_recovery wake up at 2020-02-28T15:52:44.594206-0800, re-queuing recovery 2020-02-28T15:52:44.593-0800 7f4e355f3700 10 osd.1 43 do_recovery starting 1 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] 2020-02-28T15:52:44.601-0800 7f4e355f3700 10 osd.1 43 do_recovery started 1/1 on pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={}] 2020-02-28T15:52:44.653-0800 7f4e43630700 20 osd.1 43 OSD::ms_dispatch: command(tid 2: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1 2020-02-28T15:52:44.653-0800 7f4e43630700 20 osd.1 43 _dispatch 0x55f1fc284aa0 command(tid 2: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1 2020-02-28T15:52:44.705-0800 7f4e4c621700 20 osd.1 43 do_recovery wake up at 2020-02-28T15:52:44.707898-0800, re-queuing recovery 2020-02-28T15:52:44.705-0800 7f4e355f3700 10 osd.1 43 do_recovery starting 1 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] 2020-02-28T15:52:44.709-0800 7f4e355f3700 10 osd.1 43 do_recovery started 1/1 on pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={}]
#4 Updated by David Zafman over 3 years ago
- Related to Bug #44518: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added
#5 Updated by David Zafman over 3 years ago
- Duplicated by Bug #44517: osd/osd-backfill-space.sh TEST_backfill_multi_partial: pgs didn't go active+clean added
#6 Updated by David Zafman over 3 years ago
- Duplicated by deleted (Bug #44517: osd/osd-backfill-space.sh TEST_backfill_multi_partial: pgs didn't go active+clean)
#7 Updated by David Zafman over 3 years ago
- Related to deleted (Bug #44518: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout)
#8 Updated by David Zafman over 3 years ago
- Related to Bug #44518: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added
#9 Updated by Neha Ojha over 3 years ago
Re-purposing this ticket to track a new failure, note that this is different from https://tracker.ceph.com/issues/44518 which got fixed by https://github.com/ceph/ceph/pull/34047.
/a/nojha-2020-03-22_20:02:33-rados:standalone-master-distro-basic-smithi/4880628/
2020-03-22T21:30:42.834 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:97: check: FIRST=375 2020-03-22T21:30:42.834 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:98: check: '[' -n 375 ']' 2020-03-22T21:30:42.835 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:99: check: below_margin 375 400 2020-03-22T21:30:42.835 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:43: below_margin: local -i check=375 2020-03-22T21:30:42.836 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:44: below_margin: shift 2020-03-22T21:30:42.837 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:45: below_margin: local -i target=400 2020-03-22T21:30:42.837 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:47: below_margin: return 1 2020-03-22T21:30:42.838 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:99: check: return 1 2020-03-22T21:30:42.839 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:212: TEST_backfill_sizeup_out: return 1 <pre>
#10 Updated by David Zafman over 3 years ago
- Subject changed from osd-backfill-stats.sh failing intermittently to osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin)
It would be helpful to see the osd logs when this happens. We are expecting the following sequence to occur.
Start backfilling
update_calc_stats() logs degraded and misplaced
test greps log for degraded and misplaced at start of backfills
We added a margin of 10 because a fast machine could start moving objects before update_calc_stats() happens.
I'll see what happens if I use norecover flag to give a chance for update_calc_stats() to happen. Or we could add an osd tell command to initiate an update_calc_stats().
#11 Updated by David Zafman over 3 years ago
- Status changed from New to In Progress
- Pull request ID set to 35425
#12 Updated by David Zafman over 3 years ago
- Status changed from In Progress to Pending Backport
- Backport set to octopus
#13 Updated by Nathan Cutler over 3 years ago
- Copied to Backport #46016: octopus: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin) added
#14 Updated by Nathan Cutler about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".