Project

General

Profile

Bug #14615

ceph_manager task hanging indefinitely, instead of failing

Added by Zack Cerza about 8 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen in http://pulpito.front.sepia.ceph.com/haomai-2016-01-30_15:25:39-rados:thrash-wip-14566-distro-basic-smithi/52427/

The exception raised by ceph_manager should have caused the job to terminate.

Instead, the job continues to log 72GB after this:

2016-01-30T16:24:55.708 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1298: done (15 left)
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1299: copy_from oid 402 from oid 281 current snap is 91
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 0 to smithi02021960-329
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 1 to smithi02021960-329
2016-01-30T16:24:55.796 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 2 to smithi02021960-329
2016-01-30T16:24:55.796 INFO:tasks.rados.rados.0.smithi020.stdout:update_object_version oid 329 v 1210 (ObjNum 651 snap 44 seq_num 651) dirty exists
2016-01-30T16:24:55.797 INFO:tasks.rados.rados.0.smithi020.stdout:1297: done (15 left)
2016-01-30T16:24:55.797 INFO:tasks.rados.rados.0.smithi020.stdout:1300: delete oid 81 current snap is 91
2016-01-30T16:24:55.816 INFO:tasks.rados.rados.0.smithi020.stdout:1299:  finishing copy_from to smithi02021960-402
2016-01-30T16:24:55.816 INFO:tasks.rados.rados.0.smithi020.stdout:update_object_version oid 402 v 1084 (ObjNum 420 snap 0 seq_num 420) dirty exists
2016-01-30T16:24:55.845 INFO:tasks.rados.rados.0.smithi020.stdout:1299:  finishing copy_from racing read to smithi02021960-402
2016-01-30T16:24:55.860 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:24:55.867 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:00.868 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:01.033 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:01.043 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:06.045 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:06.204 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:06.212 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:11.213 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:11.370 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:11.378 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 650, in wrapper
    return func(self)
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 704, in do_thrash
    self.choose_action()()
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 628, in <lambda>
    False),
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 501, in inject_pause
    self.ceph_manager.set_config(the_one, **{conf_key: duration})
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 1078, in set_config
    ['config', 'set', str(k), str(v)])
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 1053, in wait_run_admin_socket
    id=service_id))
Exception: timed out waiting for admin_socket to appear after osd.1 restart

2016-01-30T16:25:11.379 CRITICAL:root:  File "gevent/core.pyx", line 348, in gevent.core.loop.handle_error (gevent/gevent.core.c:6053)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 300, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 323, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-01-30T16:25:11.379 CRITICAL:root:IOError
2016-01-30T18:24:33.394 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:24:33.380087 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 5774..6486) lease_expire from mon.0 172.21.15.20:6789/0 is 0.891053 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T18:48:07.511 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:48:07.490680 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 6778..7459) lease_expire from mon.0 172.21.15.20:6789/0 is 5.119297 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T18:49:14.766 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:49:14.763190 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 6778..7503) lease_expire from mon.0 172.21.15.20:6789/0 is 0.812451 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:09:01.670 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:09:01.610067 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 7782..8354) lease_expire from mon.0 172.21.15.20:6789/0 is 3.175789 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:13:36.036 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:13:36.033045 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8033..8554) lease_expire from mon.0 172.21.15.20:6789/0 is 2.549258 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:37:38.148 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:37:38.134380 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8786..9493) lease_expire from mon.0 172.21.15.20:6789/0 is 0.822134 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:38:11.403 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:38:11.400640 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8786..9518) lease_expire from mon.0 172.21.15.20:6789/0 is 2.541366 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:38:45.790 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:38:45.786853 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9037..9543) lease_expire from mon.0 172.21.15.20:6789/0 is 4.030940 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:47:55.001 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:47:54.989453 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9288..9914) lease_expire from mon.0 172.21.15.20:6789/0 is 2.506908 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:54:07.022 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:54:07.020086 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9539..10158) lease_expire from mon.0 172.21.15.20:6789/0 is 2.266840 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:57:22.848 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:57:22.845987 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9539..10276) lease_expire from mon.0 172.21.15.20:6789/0 is 2.696128 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:37:49.765 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:37:49.750290 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 11296..11840) lease_expire from mon.0 172.21.15.20:6789/0 is 0.806196 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:46:16.034 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.032108 7f69cef42700 -1 mon.c@2(peon).data_health(104) reached critical levels of available space on local monitor storage -- shutdown!
2016-01-30T20:46:16.059 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.057651 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 11547..12172) lease_expire from mon.0 172.21.15.20:6789/0 is 0.175679 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:46:16.072 INFO:tasks.ceph.mon.a.smithi020.stderr:2016-01-31 04:46:16.070062 7f20811a9700 -1 mon.a@0(leader).data_health(104) reached critical levels of available space on local monitor storage -- shutdown!
2016-01-30T20:46:16.084 INFO:tasks.ceph.mon.a.smithi020.stderr:2016-01-31 04:46:16.082600 7f20809a8700 -1 mon.a@0(leader) e1 *** Got Signal Interrupt ***
2016-01-30T20:46:16.085 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.082632 7f69ce741700 -1 mon.c@2(peon) e1 *** Got Signal Interrupt ***
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device

History

#1 Updated by Zack Cerza about 8 years ago

Filed #14616 against ceph because I believe the logging is excessive.

#2 Updated by Yuri Weinstein about 8 years ago

  • Project changed from ceph-qa-suite to Ceph

#3 Updated by Sage Weil about 8 years ago

  • Status changed from New to Resolved

#4 Updated by Zack Cerza about 8 years ago

How was this resolved?

#5 Updated by Zack Cerza almost 8 years ago

  • Status changed from Resolved to 12

Nope.
http://pulpito.ceph.com/vasu-2016-03-30_18:27:23-rados-jewel---basic-mira/99495

2016-03-30T23:16:26.591 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph_manager.py", line 639, in wrapper
    return func(self)
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph_manager.py", line 686, in do_thrash
    timeout=self.config.get('timeout')
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph_manager.py", line 1628, in wait_for_recovery
    'failed to recover before timeout expired'
AssertionError: failed to recover before timeout expired

2016-03-30T23:16:26.592 CRITICAL:root:  File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344)
  File "/home/teuthworker/src/teuthology_wip-fix-nuke/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_wip-fix-nuke/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-03-30T23:16:26.592 CRITICAL:root:IOError
2016-03-30T23:52:02.648 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 06:52:02.647746 7f4de6d13700 -1 mon.a@0(leader).osd e80 no osd or pg stats from osd.0 since 2016-03-31 06:37:01.272357, 901.375342 seconds ago.  marking down
2016-03-31T00:08:12.696 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 07:08:12.696370 7f4de6d13700 -1 mon.a@0(leader).osd e84 no osd or pg stats from osd.0 since 2016-03-31 06:53:11.857475, 900.838848 seconds ago.  marking down
2016-03-31T00:35:02.771 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 07:35:02.771620 7f4de6d13700 -1 mon.a@0(leader).osd e89 no osd or pg stats from osd.0 since 2016-03-31 07:20:02.632755, 900.138820 seconds ago.  marking down
2016-03-31T01:10:47.878 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 08:10:47.875917 7f4de6d13700 -1 mon.a@0(leader).osd e92 no osd or pg stats from osd.0 since 2016-03-31 07:55:42.931157, 904.944716 seconds ago.  marking down
2016-03-31T01:46:22.989 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 08:46:22.985420 7f4de6d13700 -1 mon.a@0(leader).osd e96 no osd or pg stats from osd.0 since 2016-03-31 08:31:22.229274, 900.756101 seconds ago.  marking down
2016-03-31T02:32:13.123 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 09:32:13.117813 7f4de6d13700 -1 mon.a@0(leader).osd e100 no osd or pg stats from osd.0 since 2016-03-31 09:17:09.613880, 903.503889 seconds ago.  marking down
2016-03-31T02:58:43.203 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 09:58:43.197724 7f4de6d13700 -1 mon.a@0(leader).osd e107 no osd or pg stats from osd.0 since 2016-03-31 09:43:42.833180, 900.364494 seconds ago.  marking down
2016-03-31T03:14:53.250 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 10:14:53.244934 7f4de6d13700 -1 mon.a@0(leader).osd e110 no osd or pg stats from osd.0 since 2016-03-31 09:59:51.962128, 901.282764 seconds ago.  marking down
2016-03-31T03:51:33.356 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 10:51:33.351102 7f4de6d13700 -1 mon.a@0(leader).osd e115 no osd or pg stats from osd.0 since 2016-03-31 10:36:32.289194, 901.061862 seconds ago.  marking down
2016-03-31T04:07:53.404 INFO:tasks.ceph.mon.a.mira012.stderr:2016-03-31 11:07:53.399632 7f4de6d13700 -1 mon.a@0(leader).osd e118 no osd or pg stats from osd.1 since 2016-03-31 10:52:52.831942, 900.567637 seconds ago.  marking down

#6 Updated by Loïc Dachary almost 8 years ago

Could it be related to http://tracker.ceph.com/issues/15300 ? I don't think so. It starts to look like this gevent traceback is actually hidding the real cause of the error. Worse, in that case it fails the job without collecting the logs.

#7 Updated by Samuel Just over 7 years ago

  • Status changed from 12 to Can't reproduce

Also available in: Atom PDF