Bug #9620
tests: qa/workunits/cephtool/test.sh race condition
100%
Description
osd are marked down and a loop checking there are no osd down immediately follows and uses osd dump. The following happened:
test_mon_osd: 600: ceph osd dump test_mon_osd: 600: grep 'osd.0 up' *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** osd.0 up in weight 1 up_from 143 up_thru 143 down_at 140 last_clean_interval [6,142) 127.0.0.1:6800/17838 127.0.0.1:6815/1017838 127.0.0.1:6816/1017838 127.0.0.1:6817/1017838 exists,up 16d58ecc-f79f-43cd-ad7f-074cc384e12b test_mon_osd: 602: ceph osd thrash 10 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** will thrash map for 10 epochs test_mon_osd: 603: seq 0 31 test_mon_osd: 603: ceph osd down 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** marked down osd.0. osd.1 is already down. osd.2 is already down. osd.3 does not exist. osd.4 does not exist. osd.5 does not exist. osd.6 does not exist. osd.7 does not exist. osd.8 does not exist. osd.9 does not exist. osd.10 does not exist. osd.11 does not exist. osd.12 does not exist. osd.13 does not exist. osd.14 does not exist. osd.15 does not exist. osd.16 does not exist. osd.17 does not exist. osd.18 does not exist. osd.19 does not exist. osd.20 does not exist. osd.21 does not exist. osd.22 does not exist. osd.23 does not exist. osd.24 does not exist. osd.25 does not exist. osd.26 does not exist. osd.27 does not exist. osd.28 does not exist. osd.29 does not exist. osd.30 does not exist. osd.31 does not exist. test_mon_osd: 604: wait_no_osd_down wait_no_osd_down: 15: seq 1 300 wait_no_osd_down: 15: for i in '$(seq 1 300)' wait_no_osd_down: 16: check_no_osd_down check_no_osd_down: 10: ceph osd dump check_no_osd_down: 10: grep ' down ' *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** osd.0 down out weight 0 up_from 143 up_thru 145 down_at 147 last_clean_interval [6,142) 127.0.0.1:6800/17838 127.0.0.1:6815/1017838 127.0.0.1:6816/1017838 127.0.0.1:6817/1017838 exists 16d58ecc-f79f-43cd-ad7f-074cc384e12b osd.2 down in weight 1 up_from 12 up_thru 143 down_at 146 last_clean_interval [0,0) 127.0.0.1:6810/18282 127.0.0.1:6811/18282 127.0.0.1:6812/18282 127.0.0.1:6813/18282 exists c9d035f4-f848-45fd-8f56-16d5935d2d49 wait_no_osd_down: 17: echo 'waiting for osd(s) to come back up' waiting for osd(s) to come back up wait_no_osd_down: 18: sleep 1 wait_no_osd_down: 15: for i in '$(seq 1 300)' wait_no_osd_down: 16: check_no_osd_down check_no_osd_down: 10: ceph osd dump check_no_osd_down: 10: grep ' down ' *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** osd.0 down out weight 0 up_from 143 up_thru 145 down_at 147 last_clean_interval [6,142) 127.0.0.1:6800/17838 127.0.0.1:6815/1017838 127.0.0.1:6816/1017838 127.0.0.1:6817/1017838 exists 16d58ecc-f79f-43cd-ad7f-074cc384e12b osd.1 down in weight 1 up_from 148 up_thru 148 down_at 150 last_clean_interval [0,0) :/0 :/0 :/0 :/0 exists 4d383cb1-db68-4fa1-a94b-3f8a9931943c osd.2 down out weight 0 up_from 149 up_thru 149 down_at 150 last_clean_interval [0,0) :/0 :/0 :/0 :/0 exists c9d035f4-f848-45fd-8f56-16d5935d2d49 wait_no_osd_down: 17: echo 'waiting for osd(s) to come back up' waiting for osd(s) to come back up wait_no_osd_down: 18: sleep 1 wait_no_osd_down: 15: for i in '$(seq 1 300)' wait_no_osd_down: 16: check_no_osd_down check_no_osd_down: 10: ceph osd dump check_no_osd_down: 10: grep ' down ' *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** wait_no_osd_down: 20: break wait_no_osd_down: 23: check_no_osd_down check_no_osd_down: 10: ceph osd dump check_no_osd_down: 10: grep ' down ' *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** osd.2 down in weight 1 up_from 151 up_thru 151 down_at 155 last_clean_interval [0,0) :/0 :/0 :/0 :/0 exists c9d035f4-f848-45fd-8f56-16d5935d2d49
Associated revisions
qa/workunits/cephtool/test.sh: fix thrash (ultimate)
Keep the osd trash test to ensure it is a valid command but make it a
noop by giving it a zero argument (meaning thrash 0 OSD maps).
Remove the loops that were added after the command in an attempt to wait
for the cluster to recover and not pollute the rest of the tests. Actual
testing of osd thrash would require a dedicated cluster because it the
side effects are random and it is unnecessarily difficult to ensure they
are finished.
http://tracker.ceph.com/issues/9620 Fixes: #9620
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
qa/workunits/cephtool/test.sh: fix thrash (ultimate)
Keep the osd trash test to ensure it is a valid command but make it a
noop by giving it a zero argument (meaning thrash 0 OSD maps).
Remove the loops that were added after the command in an attempt to wait
for the cluster to recover and not pollute the rest of the tests. Actual
testing of osd thrash would require a dedicated cluster because it the
side effects are random and it is unnecessarily difficult to ensure they
are finished.
http://tracker.ceph.com/issues/9620 Fixes: #9620
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit beade63a17db2e6fc68d1f55332d602f8f7cb93a)
Conflicts:
qa/workunits/cephtool/test.sh
History
#1 Updated by Loïc Dachary almost 9 years ago
The following sequence happens:
- ceph osd dump finds 3 osd "down"
- ceph osd dump finds no osd "down"
- ceph osd dump finds one osd "down"
could it be a side effect of ceph osd thrash 10 that happened a few lines above ?
#2 Updated by Loïc Dachary almost 9 years ago
- Status changed from New to 12
- Assignee set to Loïc Dachary
The ceph osd thrash command will randomly mark osds down and up which explains the above.
#3 Updated by Loïc Dachary almost 9 years ago
- Status changed from 12 to Fix Under Review
- % Done changed from 0 to 80
#4 Updated by Sage Weil almost 9 years ago
- Status changed from Fix Under Review to Pending Backport
#5 Updated by Loïc Dachary almost 9 years ago
- Status changed from Pending Backport to Fix Under Review
#6 Updated by Loïc Dachary almost 9 years ago
gitbuilder running
#7 Updated by Sage Weil almost 9 years ago
- Status changed from Fix Under Review to Resolved
i jumped the gun and merged, oops!
#8 Updated by Loïc Dachary almost 9 years ago
- Status changed from Resolved to 7
#9 Updated by Loïc Dachary almost 9 years ago
I will verify the result when they are ready but I'm not too concerned ;-)
#10 Updated by Loïc Dachary almost 9 years ago
- Status changed from 7 to Resolved
- % Done changed from 80 to 100