Bug #44518
osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout
0%
Description
2020-03-09T03:36:55.899 INFO:tasks.workunit.client.0.smithi037.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress: jq -r '.pgmap | .recovering_keys_per_sec + .recovering_bytes_per_sec + .recovering_objects_per_sec' 2020-03-09T03:36:56.374 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress: local progress=null 2020-03-09T03:36:56.375 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1277: get_is_making_recovery_progress: test null '!=' null 2020-03-09T03:36:56.375 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1601: wait_for_clean: (( 13 >= 13 )) 2020-03-09T03:36:56.375 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1602: wait_for_clean: ceph report 2020-03-09T03:36:56.843 INFO:tasks.workunit.client.0.smithi037.stdout:{ ... 2020-03-09T03:36:56.846 INFO:tasks.workunit.client.0.smithi037.stdout: "message": "pg 1.0 is stuck undersized for 118s, current state active+undersized+degraded+remapped+backfilling, last acting [1,0]" ... 2020-03-09T03:36:57.073 INFO:tasks.workunit.client.0.smithi037.stdout:} 2020-03-09T03:36:57.077 INFO:tasks.workunit.client.0.smithi037.stderr:report 54303303 2020-03-09T03:36:57.077 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1603: wait_for_clean: return 1 2020-03-09T03:36:57.077 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:359: TEST_backfill_out2: return 1 2020-03-09T03:36:57.078 INFO:tasks.workunit.client.0.smithi037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:37: run: return 1
/a/sage-2020-03-09_01:44:37-rados:standalone-wip-sage2-testing-2020-03-08-1456-distro-basic-smithi/4838491
/a/sage-2020-03-09_01:44:37-rados:standalone-wip-sage2-testing-2020-03-08-1456-distro-basic-smithi/4838500
Related issues
History
#1 Updated by Sage Weil over 3 years ago
- Project changed from Ceph to RADOS
#2 Updated by Sage Weil over 3 years ago
#3 Updated by David Zafman over 3 years ago
- Related to Bug #44314: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin) added
#4 Updated by Sage Weil over 3 years ago
slightly different versoin of this:
2020-03-10T20:11:07.201 INFO:tasks.workunit.client.0.smithi018.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress: jq -r '.pgmap | .recovering_keys_per_sec + .recovering_bytes_per_sec + .recovering_objects_per_sec' 2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1276: get_is_making_recovery_progress: local progress=null 2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1277: get_is_making_recovery_progress: test null '!=' null 2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1601: wait_for_clean: (( 13 >= 13 )) 2020-03-10T20:11:07.696 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1602: wait_for_clean: ceph report ... 2020-03-10T20:11:08.412 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1603: wait_for_clean: return 1 2020-03-10T20:11:08.412 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:304: TEST_backfill_down_out: return 1 2020-03-10T20:11:08.412 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-backfill-stats.sh:37: run: return 1
/a/sage-2020-03-10_16:51:17-rados-wip-sage3-testing-2020-03-10-1037-distro-basic-smithi/4844181
#5 Updated by David Zafman over 3 years ago
- Related to deleted (Bug #44314: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin))
#6 Updated by David Zafman over 3 years ago
- Related to Bug #44314: osd-backfill-stats.sh failing intermittently in TEST_backfill_sizeup_out() (degraded outside margin) added
#7 Updated by Neha Ojha over 3 years ago
Comparing a passed test with a failed one:
PASSED - note the PG mapping [2,4,3]/[1,0] backfill=[2,3,4]
2020-03-13T19:34:30.763+0000 7f6a43ffa700 20 osd.1 43 _dispatch 0x5600f561c680 command(tid 2: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1 2020-03-13T19:34:30.903+0000 7f6a35fbe700 10 osd.1 44 do_recovery starting 1 pg[1.0( v 36'200 (36'100,36'200] local-lis/les=42/43 n=200 ec=32/32 lis/c=42/35 les/c/f=43/36/0 sis=42) [2,4,3]/[1,0] backfill=[2,3,4] r=0 lpr=42 pi=[35,42)/1 crt=36'200 lcod 36'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}]
FAILED - note the PG mapping [1,2,4]/[1,0] backfill=[2,4]
2020-03-09T03:36:52.373+0000 7fe5ec156700 20 osd.1 44 OSD::ms_dispatch: command(tid 3: {"prefix": "debug kick_recovery_wq", "delay": 0}) v1 2020-03-09T03:36:56.819+0000 7fe5ddf62700 10 osd.1 pg_epoch: 44 pg[1.0( v 37'200 (37'100,37'200] local-lis/les=43/44 n=200 ec=32/32 lis/c=43/35 les/c/f=44/37/0 sis=43) [1,2,4]/[1,0] backfill=[2,4] r=0 lpr=43 pi=[35,43)/2 crt=37'200 lcod 37'199 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={}] do_peering_event: epoch_sent: 43 epoch_requested: 43 RenewLease
This is what the test expects
# [1, 0] -> [2, 3, 4] # degraded 500 -> 0 # misplaced 1000 -> 0 # state: active+undersized+degraded+remapped+backfilling # PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP # 1.0 500 0 500 1000 0 0 100 100 active+undersized+degraded+remapped+backfilling 2017-10-27 09:55:50.375722 23'500 27:553 [2,4,3] 2 [1,0] 1 0'0 2017-10-27 09:55:10.230919 0'0 2017-10-27 09:55:10.230919 function TEST_backfill_out2() {
So I think osd.1 being in the up set for the failed test is the problem.
#8 Updated by Neha Ojha over 3 years ago
/a/sage-2020-03-09_01:44:37-rados:standalone-wip-sage2-testing-2020-03-08-1456-distro-basic-smithi/4838500
queue_recovery() gets called in on_active_actmap() when
if (recovery_state.is_peered() && !recovery_state.is_clean() && !recovery_state.get_osdmap()->test_flag(CEPH_OSDMAP_NOBACKFILL) && (!recovery_state.get_osdmap()->test_flag(CEPH_OSDMAP_NOREBALANCE) || recovery_state.is_degraded())) { queue_recovery(); }
On osd.1, the last osd_map epoch change seen was at epoch 44
2020-03-09T03:36:48.746+0000 7fe5f9009700 1 -- [v2:127.0.0.1:6812/484647,v1:127.0.0.1:6813/484647] --> [v2:127.0.0.1:6820/485002,v1:127.0.0.1:6821/485002] -- osd_map(44..44 src has 1..44) v4 -- 0x55e0201aa8c0 con 0x55e01ff8a000
From the mon logs, osd.1 was marked out at epoch 45
2020-03-09T03:36:48.944+0000 7fc997417700 5 --2- v2:127.0.0.1:7114/0 >> 127.0.0.1:0/3123918133 conn(0x55e4a6b94000 0x55e4a7450480 crc :-1 s=READY pgs=2 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x55e4a6d7ba00 seq=7 mon_command_ack([{"prefix": "osd out", "ids": ["osd.1"]}]=0 marked out osd.1. v45) v1
Since this map change is not propagated to the osd, which is why queue_recovery() never gets called.
#9 Updated by Neha Ojha over 3 years ago
- Status changed from New to In Progress
- Assignee set to Neha Ojha
#10 Updated by Neha Ojha over 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 34047
#11 Updated by Sage Weil over 3 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to octopus,nautilus
#12 Updated by Nathan Cutler over 3 years ago
- Copied to Backport #44685: octopus: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added
#13 Updated by Nathan Cutler over 3 years ago
- Copied to Backport #44686: nautilus: osd/osd-backfill-stats.sh TEST_backfill_out2: wait_for_clean timeout added
#14 Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".