Project

General

Profile

Bug #20397

MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbench.yaml

Added by Sage Weil almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-06-22T22:00:06.007 INFO:tasks.ceph.osd.2.smithi092.stderr:2017-06-22 22:00:06.009668 7efed8bab700 -1 received  signal: Hangup from  PID: 84419 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2  UID: 0
2017-06-22T22:00:06.011 INFO:teuthology.orchestra.run.smithi092.stderr:2017-06-22 22:00:06.009803 7ff9d7a58700 -1 WARNING: all dangerous and experimental features are enabled.
2017-06-22T22:00:06.033 INFO:teuthology.orchestra.run.smithi092.stderr:osd.0: osd_enable_op_tracker = 'false'
2017-06-22T22:00:06.046 INFO:teuthology.orchestra.run.smithi092.stderr:osd.1: osd_enable_op_tracker = 'false'
2017-06-22T22:00:06.055 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/full_sequential.py", line 37, in task
    mgr.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing2_2017_7_22/qa/tasks/radosbench.py", line 101, in task
    run.wait(radosbench.itervalues(), timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 432, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 132, in __call__
    raise MaxWhileTries(error_msg)
MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds
2017-06-22T22:00:06.255 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=a0e55438714f4b87b87d1ecb2de98d4e
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/full_sequential.py", line 37, in task
    mgr.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing2_2017_7_22/qa/tasks/radosbench.py", line 101, in task
    run.wait(radosbench.itervalues(), timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 432, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 132, in __call__
    raise MaxWhileTries(error_msg)
MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds

/a/yuriw-2017-06-22_20:51:40-rados-wip-yuri-testing2_2017_7_22-distro-basic-smithi/1317312
/a/yuriw-2017-06-22_20:51:40-rados-wip-yuri-testing2_2017_7_22-distro-basic-smithi/1317345


Related issues

Copied to RADOS - Backport #20497: kraken: MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbench.yaml Resolved

History

#1 Updated by Sage Weil almost 7 years ago

/a/sage-2017-06-26_14:37:54-rados-wip-sage-testing2-distro-basic-smithi/1327079
rados/thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-mkfs.yaml msgr-failures/few.yaml msgr/random.yaml objectstore/bluestore.yaml rados.yaml rocksdb.yaml thrashers/mapgap.yaml workloads/radosbench.yaml}

analyzing this one, I don't see anything wrong
- rados bench ios block for ~10m because of a backoff
- backoff because pg 7.c has only one osd and min_size is 2
- thrashosds has osd.0 (the other 7.c osd) down for the whole time while it does a bunch of random stuff (moving pgs around etc).

Thrasher is being particularly mean, but I think the failure is just because the rados bench timeout is too short for it. We just doubled the timeout in https://github.com/ceph/ceph/pull/15885 so let's see if this goes away?

#2 Updated by Sage Weil almost 7 years ago

/a/sage-2017-06-27_05:44:05-rados-wip-sage-testing-distro-basic-smithi/1331664
rados/thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml backoff/peering.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-end.yaml msgr-failures/few.yaml msgr/simple.yaml objectstore/bluestore-comp.yaml rados.yaml rocksdb.yaml thrashers/pggrow.yaml workloads/radosbench.yaml}

#3 Updated by Sage Weil almost 7 years ago

  • Status changed from 12 to 7

#4 Updated by Sage Weil almost 7 years ago

  • Assignee set to Sage Weil

#5 Updated by Sage Weil over 6 years ago

http://pulpito.ceph.com/sage-2017-06-27_15:03:40-rados:thrash-master-distro-basic-smithi/

baseline on master... 5 failed out of 193 total (-s rados/thrash --filter radosbench).

#6 Updated by Sage Weil over 6 years ago

  • Status changed from 7 to Resolved

failure seems to be gone with the timeout change.

#7 Updated by Nathan Cutler over 6 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to kraken

#8 Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #20497: kraken: MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbench.yaml added

#9 Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF