Project

General

Profile

Bug #44314

osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin)

Added by Sage Weil about 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-02-26T14:02:57.914 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress:  local progress=null
2020-02-26T14:02:57.915 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1277: get_is_making_recovery_progress:  test null '!=' null
2020-02-26T14:02:57.915 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1601: wait_for_clean:  ((  13 >= 13  ))
2020-02-26T14:02:57.915 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1602: wait_for_clean:  ceph report
...
2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:report 2563389435
2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1603: wait_for_clean:  return 1
2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:732: TEST_backfill_ec_down_out:  return 1
2020-02-26T14:02:58.655 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:37: run:  return 1

/a/sage-2020-02-26_08:10:43-rados-wip-sage2-testing-2020-02-25-2110-distro-basic-smithi/4804098

Seems like about ~30% failure rate (very roughly)?


Related issues

Related to RADOS - Bug #44518: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout Resolved
Copied to RADOS - Backport #46016: octopus: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin) Resolved

History

#1 Updated by Neha Ojha about 4 years ago

  • Assignee set to David Zafman

#2 Updated by David Zafman about 4 years ago

The unset of nobackfill happened after an attempt to start backfill was initiated and it deferred due to the flag. After it was unset, backfilling should have been requeued.

2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 44 do_recovery starting 1 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}]
2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 pg_epoch: 44 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] recover_replicas(1)
2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 pg_epoch: 44 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] recover_primary recovering 0 in pg, missing missing(0 may_include_deletes = 1)
2020-02-26T14:01:00.667+0000 7f6071666700 10 osd.1 pg_epoch: 44 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/32 les/c/f=44/33/0 sis=43) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=43 pi=[32,43)/2 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] deferring backfill due to NOBACKFILL
2020-02-26T14:01:01.797 INFO:tasks.workunit.client.0.smithi061.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:728: TEST_backfill_ec_down_out:  ceph osd unset nobackfill
2020-02-26T14:01:02.651 INFO:tasks.workunit.client.0.smithi061.stderr:nobackfill is unset

#3 Updated by David Zafman about 4 years ago

The kick_recovery_wq didn't get backfill restarted on the failed run. Or a recovery attempt (periodic?) was somehow not scheduled.

Test code:

ceph osd unset nobackfill
ceph tell osd.$primary debug kick_recovery_wq 0
sleep 2

There is no "do_recovery wake up ..." after the kick above in the failed run.

Running the test locally passes because it restarts recovery like this:

2020-02-28T15:52:42.777-0800 7f4e355f3700 10 osd.1 pg_epoch: 42 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] deferring backfill due to NOBACKFILL
2020-02-28T15:52:42.777-0800 7f4e355f3700 10 osd.1 42 do_recovery started 0/1 on pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}]
2020-02-28T15:52:44.489-0800 7f4e355f3700 10 osd.1 pg_epoch: 42 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] handle_advance_map: 43
2020-02-28T15:52:44.489-0800 7f4e355f3700 10 osd.1 pg_epoch: 42 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] handle_advance_map [1,3,2]/[1,2147483647,2] -- 1/1
2020-02-28T15:52:44.489-0800 7f4e355f3700 10 osd.1 pg_epoch: 43 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] Active: kicking snap trim
2020-02-28T15:52:44.593-0800 7f4e4c621700 20 osd.1 43 do_recovery wake up at 2020-02-28T15:52:44.594206-0800, re-queuing recovery
2020-02-28T15:52:44.593-0800 7f4e355f3700 10 osd.1 43 do_recovery starting 1 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}]
2020-02-28T15:52:44.601-0800 7f4e355f3700 10 osd.1 43 do_recovery started 1/1 on pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={}]
2020-02-28T15:52:44.653-0800 7f4e43630700 20 osd.1 43 OSD::ms_dispatch: command(tid 2: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1
2020-02-28T15:52:44.653-0800 7f4e43630700 20 osd.1 43 _dispatch 0x55f1fc284aa0 command(tid 2: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1
2020-02-28T15:52:44.705-0800 7f4e4c621700 20 osd.1 43 do_recovery wake up at 2020-02-28T15:52:44.707898-0800, re-queuing recovery
2020-02-28T15:52:44.705-0800 7f4e355f3700 10 osd.1 43 do_recovery starting 1 pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}]
2020-02-28T15:52:44.709-0800 7f4e355f3700 10 osd.1 43 do_recovery started 1/1 on pg[1.0s0( v 36'200 (36'100,36'200] local-lis/les=41/42 n=200 ec=33/33 lis/c=41/33 les/c/f=42/34/0 sis=41) [1,3,2]/[1,NONE,2]p1(0) backfill=[3(1)] r=0 lpr=41 pi=[33,41)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling rops=1 mbc={}]

#4 Updated by David Zafman about 4 years ago

  • Related to Bug #44518: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added

#5 Updated by David Zafman about 4 years ago

  • Duplicated by Bug #44517: osd/osd-backfill-space.sh TEST_backfill_multi_partial: pgs didn't go active+clean added

#6 Updated by David Zafman about 4 years ago

  • Duplicated by deleted (Bug #44517: osd/osd-backfill-space.sh TEST_backfill_multi_partial: pgs didn't go active+clean)

#7 Updated by David Zafman about 4 years ago

  • Related to deleted (Bug #44518: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout)

#8 Updated by David Zafman about 4 years ago

  • Related to Bug #44518: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added

#9 Updated by Neha Ojha about 4 years ago

Re-purposing this ticket to track a new failure, note that this is different from https://tracker.ceph.com/issues/44518 which got fixed by https://github.com/ceph/ceph/pull/34047.

/a/nojha-2020-03-22_20:02:33-rados:standalone-master-distro-basic-smithi/4880628/

2020-03-22T21:30:42.834 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:97: check:  FIRST=375
2020-03-22T21:30:42.834 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:98: check:  '[' -n 375 ']'
2020-03-22T21:30:42.835 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:99: check:  below_margin 375 400
2020-03-22T21:30:42.835 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:43: below_margin:  local -i check=375
2020-03-22T21:30:42.836 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:44: below_margin:  shift
2020-03-22T21:30:42.837 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:45: below_margin:  local -i target=400
2020-03-22T21:30:42.837 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:47: below_margin:  return 1
2020-03-22T21:30:42.838 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:99: check:  return 1
2020-03-22T21:30:42.839 INFO:tasks.workunit.client.0.smithi180.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:212: TEST_backfill_sizeup_out:  return 1
<pre>

#10 Updated by David Zafman almost 4 years ago

  • Subject changed from osd-backfill-stats.sh failing intermittently to osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin)

It would be helpful to see the osd logs when this happens. We are expecting the following sequence to occur.

Start backfilling
update_calc_stats() logs degraded and misplaced
test greps log for degraded and misplaced at start of backfills

We added a margin of 10 because a fast machine could start moving objects before update_calc_stats() happens.

I'll see what happens if I use norecover flag to give a chance for update_calc_stats() to happen. Or we could add an osd tell command to initiate an update_calc_stats().

#11 Updated by David Zafman almost 4 years ago

  • Status changed from New to In Progress
  • Pull request ID set to 35425

#12 Updated by David Zafman almost 4 years ago

  • Status changed from In Progress to Pending Backport
  • Backport set to octopus

#13 Updated by Nathan Cutler almost 4 years ago

  • Copied to Backport #46016: octopus: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin) added

#14 Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF