Project

General

Profile

Bug #15403

rados/test.sh workunit timesout on OpenStack

Added by Loic Dachary over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

When the thrasher is in action together with a validater (lockdep or valgrind), a single test may hang for more than 360 seconds which is the hardcoded value in src/test/librados/test.h

http://167.114.242.130:8081/ubuntu-2016-04-06_10:58:21-rados-hammer-backports---basic-openstack/

2016-04-06T11:54:21.968 INFO:tasks.workunit.client.0.target167114239009.stdout:[       OK ] LibRadosAio.TooBig (1158 ms)
2016-04-06T11:54:21.968 INFO:tasks.workunit.client.0.target167114239009.stdout:[ RUN      ] LibRadosAio.TooBigPP
2016-04-06T11:54:22.352 INFO:tasks.thrashosds.thrasher:in_osds:  [0, 1, 3, 4, 5, 2] out_osds:  [] dead_osds:  [] live_osds:  [1, 0, 3, 2, 5, 4]
2016-04-06T11:54:22.355 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
...
2016-04-06T12:02:18.652 INFO:tasks.workunit.client.0.target167114239009.stderr:Alarm clock
2016-04-06T12:02:18.653 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2016-04-05T22:15:51.410 INFO:tasks.workunit.client.0.target167114225255.stdout:[       OK ] LibRadosAioEC.StatRemovePP (7299 ms)
2016-04-05T22:15:51.410 INFO:tasks.workunit.client.0.target167114225255.stdout:[ RUN      ] LibRadosAioEC.OmapPP
2016-04-05T22:15:52.296 INFO:tasks.thrashosds.thrasher:in_osds:  [4, 1, 0, 3, 5, 2] out_osds:  [] dead_osds:  [] live_osds:  [1, 0, 3, 2, 5, 4]
2016-04-05T22:15:52.297 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2016-04-05T22:15:52.297 INFO:tasks.thrashosds.thrasher:Killing osd 5, live_osds are [1, 0, 3, 2, 5, 4]
2016-04-05T22:15:52.298 DEBUG:tasks.ceph.osd.5:waiting for process to exit
...
2016-04-05T22:21:54.536 INFO:tasks.workunit.client.0.target167114225255.stderr:Alarm clock
2016-04-05T22:21:54.538 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...

Related issues

Copied to Ceph - Backport #15699: hammer: rados/test.sh workunit timesout on OpenStack Resolved
Copied to Ceph - Backport #15700: jewel: rados/test.sh workunit timesout on OpenStack Resolved

History

#1 Updated by Loic Dachary over 3 years ago

I don't think it's the problem anymore. The underlying cluster blocks forever, with PG stuck in creating using the hammer-backports branch.

ubuntu@target167114242135:~$ sudo ceph -w
2016-04-06 17:45:26.502561 7f9a7ea2b700  0 lockdep start
    cluster 5079a5a5-26d9-4b08-bf6c-21206aff8bf6
     health HEALTH_WARN
            1 pgs stuck inactive
            1 pgs stuck unclean
            1 requests are blocked > 32 sec
            pool rbd pg_num 74 > pgp_num 64
            mon.a has mon_osd_down_out_interval set to 0
     monmap e1: 3 mons at {a=167.114.242.135:6789/0,b=167.114.242.140:6789/0,c=167.114.242.135:6790/0}
            election epoch 4, quorum 0,1,2 a,b,c
     osdmap e423: 6 osds: 6 up, 6 in; 1 remapped pgs
      pgmap v850: 82 pgs, 2 pools, 128 bytes data, 1 objects
            19160 MB used, 207 GB / 236 GB avail
                  81 active+clean
                   1 creating

2016-04-06 17:45:23.165192 mon.0 [INF] pgmap v850: 82 pgs: 1 creating, 81 active+clean; 128 bytes data, 19160 MB used, 207 GB / 236 GB avail
^C2016-04-06 17:45:29.756575 7f9a7ea2b700  0 lockdep stop
(reverse-i-search)`': ^C
ubuntu@target167114242135:~$ ceph pg dump | grep create
2016-04-06 17:45:36.838845 7f85643e4700  0 lockdep start
dumped all in format plain
2016-04-06 17:45:36.950630 7f85643e4700  0 lockdep stop
ubuntu@target167114242135:~$ ceph pg dump | grep creatin
2016-04-06 17:45:38.356023 7f9089d6f700  0 lockdep start
dumped all in format plain
79.2    0    0    0    0    0    0    0    0    creating    0.000000    0'0    0:0    [0,1,2]    1    [0,NONE,2]    0    0'0    2016-04-06 14:56:54.957350    0'0    2016-04-06 14:56:54.957350
2016-04-06 17:45:38.469395 7f9089d6f700  0 lockdep stop
ubuntu@target167114242135:~$ ceph osd dump | grep 79
2016-04-06 17:45:56.631351 7f0314807700  0 lockdep start
fsid 5079a5a5-26d9-4b08-bf6c-21206aff8bf6
pool 79 'test-rados-api-target167114242135.teuthology-16553-3' erasure size 3 min_size 2 crush_ruleset 2 object_hash rjenkins pg_num 8 pgp_num 8 last_change 409 flags hashpspool stripe_width 4096
pg_temp 79.2 [0,2147483647,2]
2016-04-06 17:45:56.760952 7f0314807700  0 lockdep stop
ubuntu@target167114242135:~$ ceph osd map test-rados-api-target167114242135.teuthology-16553-3 79.2
2016-04-06 17:46:10.337018 7f72ec004700  0 lockdep start
osdmap e423 pool 'test-rados-api-target167114242135.teuthology-16553-3' (79) object '79.2' -> pg 79.66540d7d (79.5) -> up ([0,4,2], p0) acting ([0,4,2], p0)
2016-04-06 17:46:10.452568 7f72ec004700  0 lockdep stop

and a run on hammer passed.

#2 Updated by Nathan Cutler over 3 years ago

  • Status changed from In Progress to Fix Under Review

#3 Updated by Nathan Cutler over 3 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from hammer, infernalis to hammer, jewel

#4 Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #15699: hammer: rados/test.sh workunit timesout on OpenStack added

#5 Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #15700: jewel: rados/test.sh workunit timesout on OpenStack added

#6 Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF