Project

General

Profile

Actions

Bug #10370

closed

"MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds" in upgrade:dumpling-firefly-x:stress-split-next-distro-basic-vps run

Added by Yuri Weinstein over 9 years ago. Updated about 9 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-12-17_17:25:01-upgrade:dumpling-firefly-x:stress-split-next-distro-basic-vps/666598/

2014-12-17T23:48:55.902 INFO:tasks.ceph.mon.c.vpm157.stderr:2014-12-18 02:48:55.901685 7f098fba4700 -1 mon.c@2(peon).mds e5 Missing health data for MDS 4112
2014-12-17T23:48:55.916 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 5 pgs peering; 7 pgs stuck inactive; 7 pgs stuck unclean
2014-12-17T23:49:02.917 INFO:teuthology.orchestra.run.vpm157:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph health'
2014-12-17T23:49:03.175 INFO:tasks.ceph.mon.a.vpm157.stderr:2014-12-18 02:49:03.175628 7fdca36ec700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4112
2014-12-17T23:49:03.187 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 5 pgs peering; 7 pgs stuck inactive; 7 pgs stuck unclean
2014-12-17T23:49:04.187 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 55, in run_tasks
    manager.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_next/tasks/ceph.py", line 1090, in restart
    healthy(ctx=ctx, config=None)
  File "/var/lib/teuthworker/src/ceph-qa-suite_next/tasks/ceph.py", line 995, in healthy
    remote=mon0_remote,
  File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 853, in wait_until_healthy
    while proceed():
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #10908: "Crash: timed out waiting for admin_socket to appear after osd.11 restart" in upgrade:giant-x-wip-sam-testing-distro-basic-multResolvedSamuel Just02/17/2015

Actions
Actions #1

Updated by Samuel Just over 9 years ago

  • Project changed from Ceph to CephFS
Actions #2

Updated by Greg Farnum over 9 years ago

  • Project changed from CephFS to Ceph
  • Subject changed from "mds e5 Missing health data for MDS 4112" in upgrade:dumpling-firefly-x:stress-split-next-distro-basic-vps run to "MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds" in upgrade:dumpling-firefly-x:stress-split-next-distro-basic-vps run

Sure looks to me like the error here is the HEALTH_WARN on stuck PGs, rather than the MDS not reporting health (presumably because it's a Dumpling MDS with upgraded Giant monitors).

Actions #3

Updated by Samuel Just about 9 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Samuel Just about 9 years ago

  • Status changed from New to Can't reproduce

marking this as can't reproduce for now

Actions #5

Updated by Yuri Weinstein about 9 years ago

  • Status changed from Can't reproduce to New

Reopening as looks similar and now we have coredump.

Run: http://pulpito.ceph.com/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/
Job: 761725
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/

In teuthology log:

2015-02-16T21:49:24.166 INFO:tasks.ceph.mon.a.plana80.stderr:2015-02-16 21:49:24.165894 7f9d53b30700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4112
2015-02-16T21:50:24.167 INFO:tasks.ceph.mon.a.plana80.stderr:2015-02-16 21:50:24.166225 7f9d53b30700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4112
2015-02-16T21:51:24.167 INFO:tasks.ceph.mon.a.plana80.stderr:2015-02-16 21:51:24.166554 7f9d53b30700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4112
2015-02-16T21:52:24.167 INFO:tasks.ceph.mon.a.plana80.stderr:2015-02-16 21:52:24.166884 7f9d53b30700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4112
2015-02-16T21:53:24.168 INFO:tasks.ceph.mon.a.plana80.stderr:2015-02-16 21:53:24.167222 7f9d53b30700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4112
2015-02-16T21:53:25.937 INFO:tasks.workunit.client.0.plana65.stderr:test_rbd.test_create_defaults ...
2015-02-16T21:53:25.937 INFO:tasks.workunit:Stopping ['rbd/test_librbd_python.sh'] on client.0...
2015-02-16T21:53:25.938 INFO:teuthology.orchestra.run.plana65:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list /home/ubuntu/cephtest/workunit.client.0'
2015-02-16T21:53:25.948 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 360, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 137, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test rbd/test_librbd_python.sh) on plana65 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=firefly TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh'

and then in /a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz

/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572246-2015-02-16 19:05:53.214124 7fe265cc0700 -1 *** Caught signal (Aborted) **
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572320- in thread 7fe265cc0700
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572344-
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572345: ceph version 0.80.8-49-g9ef7743 (9ef77430f3d46789b0ba1a2afa42729627734500)
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572421- 1: ceph-osd() [0x99bc2a]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572447- 2: (()+0xfcb0) [0x7fe27b6eccb0]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572480- 3: (gsignal()+0x35) [0x7fe279fd7425]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572518- 4: (abort()+0x17b) [0x7fe279fdab8b]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572555- 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe27a92a69d]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572625- 6: (()+0xb5846) [0x7fe27a928846]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572659- 7: (()+0xb5873) [0x7fe27a928873]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572693- 8: (()+0xb596e) [0x7fe27a92896e]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572727- 9: (ObjectStore::Transaction::decode(ceph::buffer::list::iterator&)+0x219) [0x83af79]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572814- 10: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x5f3) [0x7d91f3]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572904- 11: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x55c) [0x91dedc]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400572995- 12: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ee) [0x7ba37e]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573100- 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x34a) [0x614a9a]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573222- 14: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x628) [0x630ca8]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573315- 15: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x6772ac]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573506- 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xa6ee86]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573574- 17: (ThreadPool::WorkThread::entry()+0x10) [0xa70ea0]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573629- 18: (()+0x7e9a) [0x7fe27b6e4e9a]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573663- 19: (clone()+0x6d) [0x7fe27a0953fd]
/a/teuthology-2015-02-16_17:13:02-upgrade:firefly-x-hammer-distro-basic-multi/761725/remote/plana13/log/ceph-osd.12.log.gz:400573700- NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #6

Updated by Samuel Just about 9 years ago

  • Status changed from New to Duplicate

That second one at least is #10908, which is fixed as of a few days ago.

Actions

Also available in: Atom PDF