Bug #9074: gitbuilder: make check does not complete, sometimes - devops - Ceph

Actions

Copy link

Bug #9074

closed

gitbuilder: make check does not complete, sometimes

Added by Loïc Dachary over 9 years ago. Updated over 9 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Loïc Dachary

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

It looks like i386 build fails because a timeout interrupts it before it gets a chance to complete.

It could be that the timeout is too short. If the i386 build machines are slower than the others, it would explain why it happens more on this build.

I've experienced that, locally, on master every now and then test.sh get stuck somewhere around # make sure everything gets back up+in.

http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-tarball-trusty-i386-basic/log.cgi?log=5808d6a6a514a7c7e9cd094a0e047585ac66a161

Files

Download all files

osd.0.log (448 KB) osd.0.log	GOOD osd.0.log	Loïc Dachary, 08/13/2014 01:02 AM
mon.a.log (3.78 MB) mon.a.log	GOOD mon.a.log	Loïc Dachary, 08/13/2014 01:03 AM
osd.0.log (403 KB) osd.0.log	BAD osd.0.log	Loïc Dachary, 08/13/2014 01:03 AM
mon.a.log (3.54 MB) mon.a.log	BAD mon.a.log	Loïc Dachary, 08/13/2014 01:03 AM

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

re-run the build to check if it fails always or sometimes

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

File osd.0.log osd.0.log added
Status changed from New to 12
Assignee set to Loïc Dachary
Priority changed from Normal to High

test.sh fails to complete (~50% of the time) when testing noup":https://github.com/ceph/ceph/blob/ea731ae14216bb479eff1f86ed6bd4a7cb71fb56/qa/workunits/cephtool/test.sh#L517 with the following trace:

....
pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
max_osd 3
osd.0 down in  weight 1 up_from 4 up_thru 108 down_at 140 last_clean_interval [0,0) 127.0.0.1:6800/31456 127.0.0.1:6801/31456 127.0.0.1:6802/31456 127.0.0.1:6803/31456 exists 5141c944-afcb-42b8-90d3-e7344a6fb169
osd.1 up   in  weight 1 up_from 8 up_thru 140 down_at 0 last_clean_interval [0,0) 127.0.0.1:6805/31667 127.0.0.1:6806/31667 127.0.0.1:6807/31667 127.0.0.1:6808/31667 exists,up 30553181-6a93-466b-9372-08baf202abd5
osd.2 up   in  weight 1 up_from 13 up_thru 140 down_at 0 last_clean_interval [0,0) 127.0.0.1:6810/31901 127.0.0.1:6811/31901 127.0.0.1:6812/31901 127.0.0.1:6813/31901 exists,up 23ab6473-d56c-4b9e-91f0-4f237e2bb7d0
 test_mon_osd: 519: ceph osd dump
 test_mon_osd: 519: grep 'osd.0 down'
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
osd.0 down in  weight 1 up_from 4 up_thru 108 down_at 140 last_clean_interval [0,0) 127.0.0.1:6800/31456 127.0.0.1:6801/31456 127.0.0.1:6802/31456 127.0.0.1:6803/31456 exists 5141c944-afcb-42b8-90d3-e7344a6fb169
 test_mon_osd: 520: ceph osd unset noup
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
unset noup
 test_mon_osd: 521: (( i=0 ))
 test_mon_osd: 521: (( i < 100 ))
 test_mon_osd: 522: grep 'osd.0 up'
 test_mon_osd: 522: ceph osd dump
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 test_mon_osd: 523: echo 'waiting for osd.0 to come back up'
waiting for osd.0 to come back up
 test_mon_osd: 524: sleep 10
 test_mon_osd: 521: (( i++ ))
 test_mon_osd: 521: (( i < 100 ))
 test_mon_osd: 522: ceph osd dump
 test_mon_osd: 522: grep 'osd.0 up'
...

Attached are logs of the mon + osd 0 when it is ok and when it is not, for comparison.

Actions

Copy link Download all files

Updated by Loïc Dachary over 9 years ago

File mon.a.log mon.a.log added
File osd.0.log osd.0.log added
File mon.a.log mon.a.log added

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Wrong diagnostic, the error is not from here. It loops while waiting for osds to come back up a few lines below I was confused because the error messages are similar

Actions

Copy link