Project

General

Profile

Actions

Bug #14616

closed

When logging to a file fails, ceph logs excessively to stderr

Added by Zack Cerza about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen in http://pulpito.front.sepia.ceph.com/haomai-2016-01-30_15:25:39-rados:thrash-wip-14566-distro-basic-smithi/52427/

The exception raised by ceph_manager should have caused the job to terminate (filed as #14615).

Instead, the job continues to log 72GB after this:

2016-01-30T16:24:55.708 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1298: done (15 left)
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1299: copy_from oid 402 from oid 281 current snap is 91
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 0 to smithi02021960-329
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 1 to smithi02021960-329
2016-01-30T16:24:55.796 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 2 to smithi02021960-329
2016-01-30T16:24:55.796 INFO:tasks.rados.rados.0.smithi020.stdout:update_object_version oid 329 v 1210 (ObjNum 651 snap 44 seq_num 651) dirty exists
2016-01-30T16:24:55.797 INFO:tasks.rados.rados.0.smithi020.stdout:1297: done (15 left)
2016-01-30T16:24:55.797 INFO:tasks.rados.rados.0.smithi020.stdout:1300: delete oid 81 current snap is 91
2016-01-30T16:24:55.816 INFO:tasks.rados.rados.0.smithi020.stdout:1299:  finishing copy_from to smithi02021960-402
2016-01-30T16:24:55.816 INFO:tasks.rados.rados.0.smithi020.stdout:update_object_version oid 402 v 1084 (ObjNum 420 snap 0 seq_num 420) dirty exists
2016-01-30T16:24:55.845 INFO:tasks.rados.rados.0.smithi020.stdout:1299:  finishing copy_from racing read to smithi02021960-402
2016-01-30T16:24:55.860 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:24:55.867 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:00.868 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:01.033 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:01.043 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:06.045 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:06.204 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:06.212 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:11.213 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:11.370 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:11.378 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 650, in wrapper
    return func(self)
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 704, in do_thrash
    self.choose_action()()
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 628, in <lambda>
    False),
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 501, in inject_pause
    self.ceph_manager.set_config(the_one, **{conf_key: duration})
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 1078, in set_config
    ['config', 'set', str(k), str(v)])
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 1053, in wait_run_admin_socket
    id=service_id))
Exception: timed out waiting for admin_socket to appear after osd.1 restart

2016-01-30T16:25:11.379 CRITICAL:root:  File "gevent/core.pyx", line 348, in gevent.core.loop.handle_error (gevent/gevent.core.c:6053)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 300, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 323, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-01-30T16:25:11.379 CRITICAL:root:IOError
2016-01-30T18:24:33.394 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:24:33.380087 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 5774..6486) lease_expire from mon.0 172.21.15.20:6789/0 is 0.891053 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T18:48:07.511 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:48:07.490680 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 6778..7459) lease_expire from mon.0 172.21.15.20:6789/0 is 5.119297 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T18:49:14.766 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:49:14.763190 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 6778..7503) lease_expire from mon.0 172.21.15.20:6789/0 is 0.812451 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:09:01.670 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:09:01.610067 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 7782..8354) lease_expire from mon.0 172.21.15.20:6789/0 is 3.175789 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:13:36.036 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:13:36.033045 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8033..8554) lease_expire from mon.0 172.21.15.20:6789/0 is 2.549258 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:37:38.148 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:37:38.134380 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8786..9493) lease_expire from mon.0 172.21.15.20:6789/0 is 0.822134 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:38:11.403 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:38:11.400640 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8786..9518) lease_expire from mon.0 172.21.15.20:6789/0 is 2.541366 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:38:45.790 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:38:45.786853 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9037..9543) lease_expire from mon.0 172.21.15.20:6789/0 is 4.030940 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:47:55.001 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:47:54.989453 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9288..9914) lease_expire from mon.0 172.21.15.20:6789/0 is 2.506908 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:54:07.022 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:54:07.020086 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9539..10158) lease_expire from mon.0 172.21.15.20:6789/0 is 2.266840 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:57:22.848 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:57:22.845987 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9539..10276) lease_expire from mon.0 172.21.15.20:6789/0 is 2.696128 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:37:49.765 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:37:49.750290 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 11296..11840) lease_expire from mon.0 172.21.15.20:6789/0 is 0.806196 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:46:16.034 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.032108 7f69cef42700 -1 mon.c@2(peon).data_health(104) reached critical levels of available space on local monitor storage -- shutdown!
2016-01-30T20:46:16.059 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.057651 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 11547..12172) lease_expire from mon.0 172.21.15.20:6789/0 is 0.175679 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:46:16.072 INFO:tasks.ceph.mon.a.smithi020.stderr:2016-01-31 04:46:16.070062 7f20811a9700 -1 mon.a@0(leader).data_health(104) reached critical levels of available space on local monitor storage -- shutdown!
2016-01-30T20:46:16.084 INFO:tasks.ceph.mon.a.smithi020.stderr:2016-01-31 04:46:16.082600 7f20809a8700 -1 mon.a@0(leader) e1 *** Got Signal Interrupt ***
2016-01-30T20:46:16.085 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.082632 7f69ce741700 -1 mon.c@2(peon) e1 *** Got Signal Interrupt ***
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device

I'm filing this against ceph because this seems like an excessive amount of logging.


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #15081: hammer: When logging to a file fails, ceph logs excessively to stderrResolvedNathan CutlerActions
Actions #1

Updated by Sage Weil about 8 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to Urgent
Actions #2

Updated by Sage Weil about 8 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Nathan Cutler about 8 years ago

  • Backport set to hammer
Actions #4

Updated by Sage Weil about 8 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Nathan Cutler about 8 years ago

  • Copied to Backport #15081: hammer: When logging to a file fails, ceph logs excessively to stderr added
Actions #6

Updated by Nathan Cutler over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF