Actions
Bug #15403
closedrados/test.sh workunit timesout on OpenStack
% Done:
0%
Source:
other
Tags:
Backport:
hammer, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
When the thrasher is in action together with a validater (lockdep or valgrind), a single test may hang for more than 360 seconds which is the hardcoded value in src/test/librados/test.h
http://167.114.242.130:8081/ubuntu-2016-04-06_10:58:21-rados-hammer-backports---basic-openstack/
2016-04-06T11:54:21.968 INFO:tasks.workunit.client.0.target167114239009.stdout:[ OK ] LibRadosAio.TooBig (1158 ms) 2016-04-06T11:54:21.968 INFO:tasks.workunit.client.0.target167114239009.stdout:[ RUN ] LibRadosAio.TooBigPP 2016-04-06T11:54:22.352 INFO:tasks.thrashosds.thrasher:in_osds: [0, 1, 3, 4, 5, 2] out_osds: [] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4] 2016-04-06T11:54:22.355 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0 ... 2016-04-06T12:02:18.652 INFO:tasks.workunit.client.0.target167114239009.stderr:Alarm clock 2016-04-06T12:02:18.653 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2016-04-05T22:15:51.410 INFO:tasks.workunit.client.0.target167114225255.stdout:[ OK ] LibRadosAioEC.StatRemovePP (7299 ms) 2016-04-05T22:15:51.410 INFO:tasks.workunit.client.0.target167114225255.stdout:[ RUN ] LibRadosAioEC.OmapPP 2016-04-05T22:15:52.296 INFO:tasks.thrashosds.thrasher:in_osds: [4, 1, 0, 3, 5, 2] out_osds: [] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4] 2016-04-05T22:15:52.297 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0 2016-04-05T22:15:52.297 INFO:tasks.thrashosds.thrasher:Killing osd 5, live_osds are [1, 0, 3, 2, 5, 4] 2016-04-05T22:15:52.298 DEBUG:tasks.ceph.osd.5:waiting for process to exit ... 2016-04-05T22:21:54.536 INFO:tasks.workunit.client.0.target167114225255.stderr:Alarm clock 2016-04-05T22:21:54.538 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
Updated by Loïc Dachary about 8 years ago
I don't think it's the problem anymore. The underlying cluster blocks forever, with PG stuck in creating using the hammer-backports branch.
ubuntu@target167114242135:~$ sudo ceph -w 2016-04-06 17:45:26.502561 7f9a7ea2b700 0 lockdep start cluster 5079a5a5-26d9-4b08-bf6c-21206aff8bf6 health HEALTH_WARN 1 pgs stuck inactive 1 pgs stuck unclean 1 requests are blocked > 32 sec pool rbd pg_num 74 > pgp_num 64 mon.a has mon_osd_down_out_interval set to 0 monmap e1: 3 mons at {a=167.114.242.135:6789/0,b=167.114.242.140:6789/0,c=167.114.242.135:6790/0} election epoch 4, quorum 0,1,2 a,b,c osdmap e423: 6 osds: 6 up, 6 in; 1 remapped pgs pgmap v850: 82 pgs, 2 pools, 128 bytes data, 1 objects 19160 MB used, 207 GB / 236 GB avail 81 active+clean 1 creating 2016-04-06 17:45:23.165192 mon.0 [INF] pgmap v850: 82 pgs: 1 creating, 81 active+clean; 128 bytes data, 19160 MB used, 207 GB / 236 GB avail ^C2016-04-06 17:45:29.756575 7f9a7ea2b700 0 lockdep stop (reverse-i-search)`': ^C ubuntu@target167114242135:~$ ceph pg dump | grep create 2016-04-06 17:45:36.838845 7f85643e4700 0 lockdep start dumped all in format plain 2016-04-06 17:45:36.950630 7f85643e4700 0 lockdep stop ubuntu@target167114242135:~$ ceph pg dump | grep creatin 2016-04-06 17:45:38.356023 7f9089d6f700 0 lockdep start dumped all in format plain 79.2 0 0 0 0 0 0 0 0 creating 0.000000 0'0 0:0 [0,1,2] 1 [0,NONE,2] 0 0'0 2016-04-06 14:56:54.957350 0'0 2016-04-06 14:56:54.957350 2016-04-06 17:45:38.469395 7f9089d6f700 0 lockdep stop ubuntu@target167114242135:~$ ceph osd dump | grep 79 2016-04-06 17:45:56.631351 7f0314807700 0 lockdep start fsid 5079a5a5-26d9-4b08-bf6c-21206aff8bf6 pool 79 'test-rados-api-target167114242135.teuthology-16553-3' erasure size 3 min_size 2 crush_ruleset 2 object_hash rjenkins pg_num 8 pgp_num 8 last_change 409 flags hashpspool stripe_width 4096 pg_temp 79.2 [0,2147483647,2] 2016-04-06 17:45:56.760952 7f0314807700 0 lockdep stop ubuntu@target167114242135:~$ ceph osd map test-rados-api-target167114242135.teuthology-16553-3 79.2 2016-04-06 17:46:10.337018 7f72ec004700 0 lockdep start osdmap e423 pool 'test-rados-api-target167114242135.teuthology-16553-3' (79) object '79.2' -> pg 79.66540d7d (79.5) -> up ([0,4,2], p0) acting ([0,4,2], p0) 2016-04-06 17:46:10.452568 7f72ec004700 0 lockdep stop
and a run on hammer passed.
Updated by Nathan Cutler almost 8 years ago
- Status changed from In Progress to Fix Under Review
master PR: https://github.com/ceph/ceph/pull/8469
Updated by Nathan Cutler almost 8 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from hammer, infernalis to hammer, jewel
Updated by Nathan Cutler almost 8 years ago
- Copied to Backport #15699: hammer: rados/test.sh workunit timesout on OpenStack added
Updated by Nathan Cutler almost 8 years ago
- Copied to Backport #15700: jewel: rados/test.sh workunit timesout on OpenStack added
Updated by Nathan Cutler over 7 years ago
- Status changed from Pending Backport to Resolved
Actions