Bug #20397
MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbench.yaml
0%
Description
2017-06-22T22:00:06.007 INFO:tasks.ceph.osd.2.smithi092.stderr:2017-06-22 22:00:06.009668 7efed8bab700 -1 received signal: Hangup from PID: 84419 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2 UID: 0 2017-06-22T22:00:06.011 INFO:teuthology.orchestra.run.smithi092.stderr:2017-06-22 22:00:06.009803 7ff9d7a58700 -1 WARNING: all dangerous and experimental features are enabled. 2017-06-22T22:00:06.033 INFO:teuthology.orchestra.run.smithi092.stderr:osd.0: osd_enable_op_tracker = 'false' 2017-06-22T22:00:06.046 INFO:teuthology.orchestra.run.smithi092.stderr:osd.1: osd_enable_op_tracker = 'false' 2017-06-22T22:00:06.055 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/full_sequential.py", line 37, in task mgr.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing2_2017_7_22/qa/tasks/radosbench.py", line 101, in task run.wait(radosbench.itervalues(), timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 432, in wait check_time() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 132, in __call__ raise MaxWhileTries(error_msg) MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds 2017-06-22T22:00:06.255 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=a0e55438714f4b87b87d1ecb2de98d4e Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/full_sequential.py", line 37, in task mgr.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing2_2017_7_22/qa/tasks/radosbench.py", line 101, in task run.wait(radosbench.itervalues(), timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 432, in wait check_time() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 132, in __call__ raise MaxWhileTries(error_msg) MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds
/a/yuriw-2017-06-22_20:51:40-rados-wip-yuri-testing2_2017_7_22-distro-basic-smithi/1317312
/a/yuriw-2017-06-22_20:51:40-rados-wip-yuri-testing2_2017_7_22-distro-basic-smithi/1317345
Related issues
History
#1 Updated by Sage Weil almost 7 years ago
/a/sage-2017-06-26_14:37:54-rados-wip-sage-testing2-distro-basic-smithi/1327079
rados/thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/mapgap.yaml workloads/radosbench.yaml}
analyzing this one, I don't see anything wrong
- rados bench ios block for ~10m because of a backoff
- backoff because pg 7.c has only one osd and min_size is 2
- thrashosds has osd.0 (the other 7.c osd) down for the whole time while it does a bunch of random stuff (moving pgs around etc).
Thrasher is being particularly mean, but I think the failure is just because the rados bench timeout is too short for it. We just doubled the timeout in https://github.com/ceph/ceph/pull/15885 so let's see if this goes away?
#2 Updated by Sage Weil almost 7 years ago
/a/sage-2017-06-27_05:44:05-rados-wip-sage-testing-distro-basic-smithi/1331664
rados/thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-end.yaml msgr-failures/few.yaml msgr/simple.yaml objectstore/bluestore-comp.yaml rados.yaml rocksdb.yaml thrashers/pggrow.yaml workloads/radosbench.yaml}
#3 Updated by Sage Weil almost 7 years ago
- Status changed from 12 to 7
#4 Updated by Sage Weil almost 7 years ago
- Assignee set to Sage Weil
#5 Updated by Sage Weil over 6 years ago
http://pulpito.ceph.com/sage-2017-06-27_15:03:40-rados:thrash-master-distro-basic-smithi/
baseline on master... 5 failed out of 193 total (-s rados/thrash --filter radosbench).
#6 Updated by Sage Weil over 6 years ago
- Status changed from 7 to Resolved
failure seems to be gone with the timeout change.
#7 Updated by Nathan Cutler over 6 years ago
- Status changed from Resolved to Pending Backport
- Backport set to kraken
#8 Updated by Nathan Cutler over 6 years ago
- Copied to Backport #20497: kraken: MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbench.yaml added
#9 Updated by Nathan Cutler over 6 years ago
- Status changed from Pending Backport to Resolved