Project

General

Profile

Bug #14616

When logging to a file fails, ceph logs excessively to stderr

Added by Zack Cerza about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
02/02/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Seen in http://pulpito.front.sepia.ceph.com/haomai-2016-01-30_15:25:39-rados:thrash-wip-14566-distro-basic-smithi/52427/

The exception raised by ceph_manager should have caused the job to terminate (filed as #14615).

Instead, the job continues to log 72GB after this:

2016-01-30T16:24:55.708 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1298: done (15 left)
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1299: copy_from oid 402 from oid 281 current snap is 91
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 0 to smithi02021960-329
2016-01-30T16:24:55.795 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 1 to smithi02021960-329
2016-01-30T16:24:55.796 INFO:tasks.rados.rados.0.smithi020.stdout:1297:  finishing rollback tid 2 to smithi02021960-329
2016-01-30T16:24:55.796 INFO:tasks.rados.rados.0.smithi020.stdout:update_object_version oid 329 v 1210 (ObjNum 651 snap 44 seq_num 651) dirty exists
2016-01-30T16:24:55.797 INFO:tasks.rados.rados.0.smithi020.stdout:1297: done (15 left)
2016-01-30T16:24:55.797 INFO:tasks.rados.rados.0.smithi020.stdout:1300: delete oid 81 current snap is 91
2016-01-30T16:24:55.816 INFO:tasks.rados.rados.0.smithi020.stdout:1299:  finishing copy_from to smithi02021960-402
2016-01-30T16:24:55.816 INFO:tasks.rados.rados.0.smithi020.stdout:update_object_version oid 402 v 1084 (ObjNum 420 snap 0 seq_num 420) dirty exists
2016-01-30T16:24:55.845 INFO:tasks.rados.rados.0.smithi020.stdout:1299:  finishing copy_from racing read to smithi02021960-402
2016-01-30T16:24:55.860 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:24:55.867 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:00.868 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:01.033 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:01.043 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:06.045 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:06.204 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:06.212 INFO:tasks.ceph.ceph_manager:waiting on admin_socket for osd-1, ['config', 'set', 'filestore_inject_stall', '3']
2016-01-30T16:25:11.213 INFO:teuthology.orchestra.run.smithi020:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config set filestore_inject_stall 3'
2016-01-30T16:25:11.370 INFO:teuthology.orchestra.run.smithi020.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2016-01-30T16:25:11.378 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 650, in wrapper
    return func(self)
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 704, in do_thrash
    self.choose_action()()
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 628, in <lambda>
    False),
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 501, in inject_pause
    self.ceph_manager.set_config(the_one, **{conf_key: duration})
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 1078, in set_config
    ['config', 'set', str(k), str(v)])
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip-async/tasks/ceph_manager.py", line 1053, in wait_run_admin_socket
    id=service_id))
Exception: timed out waiting for admin_socket to appear after osd.1 restart

2016-01-30T16:25:11.379 CRITICAL:root:  File "gevent/core.pyx", line 348, in gevent.core.loop.handle_error (gevent/gevent.core.c:6053)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 300, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 323, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-01-30T16:25:11.379 CRITICAL:root:IOError
2016-01-30T18:24:33.394 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:24:33.380087 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 5774..6486) lease_expire from mon.0 172.21.15.20:6789/0 is 0.891053 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T18:48:07.511 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:48:07.490680 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 6778..7459) lease_expire from mon.0 172.21.15.20:6789/0 is 5.119297 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T18:49:14.766 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 02:49:14.763190 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 6778..7503) lease_expire from mon.0 172.21.15.20:6789/0 is 0.812451 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:09:01.670 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:09:01.610067 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 7782..8354) lease_expire from mon.0 172.21.15.20:6789/0 is 3.175789 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:13:36.036 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:13:36.033045 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8033..8554) lease_expire from mon.0 172.21.15.20:6789/0 is 2.549258 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:37:38.148 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:37:38.134380 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8786..9493) lease_expire from mon.0 172.21.15.20:6789/0 is 0.822134 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:38:11.403 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:38:11.400640 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 8786..9518) lease_expire from mon.0 172.21.15.20:6789/0 is 2.541366 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:38:45.790 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:38:45.786853 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9037..9543) lease_expire from mon.0 172.21.15.20:6789/0 is 4.030940 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:47:55.001 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:47:54.989453 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9288..9914) lease_expire from mon.0 172.21.15.20:6789/0 is 2.506908 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:54:07.022 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:54:07.020086 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9539..10158) lease_expire from mon.0 172.21.15.20:6789/0 is 2.266840 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T19:57:22.848 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 03:57:22.845987 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 9539..10276) lease_expire from mon.0 172.21.15.20:6789/0 is 2.696128 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:37:49.765 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:37:49.750290 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 11296..11840) lease_expire from mon.0 172.21.15.20:6789/0 is 0.806196 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:46:16.034 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.032108 7f69cef42700 -1 mon.c@2(peon).data_health(104) reached critical levels of available space on local monitor storage -- shutdown!
2016-01-30T20:46:16.059 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.057651 7f69cff44700 -1 mon.c@2(peon).paxos(paxos updating c 11547..12172) lease_expire from mon.0 172.21.15.20:6789/0 is 0.175679 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
2016-01-30T20:46:16.072 INFO:tasks.ceph.mon.a.smithi020.stderr:2016-01-31 04:46:16.070062 7f20811a9700 -1 mon.a@0(leader).data_health(104) reached critical levels of available space on local monitor storage -- shutdown!
2016-01-30T20:46:16.084 INFO:tasks.ceph.mon.a.smithi020.stderr:2016-01-31 04:46:16.082600 7f20809a8700 -1 mon.a@0(leader) e1 *** Got Signal Interrupt ***
2016-01-30T20:46:16.085 INFO:tasks.ceph.mon.c.smithi020.stderr:2016-01-31 04:46:16.082632 7f69ce741700 -1 mon.c@2(peon) e1 *** Got Signal Interrupt ***
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.493 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.494 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device
2016-01-30T20:58:44.495 INFO:tasks.rados.rados.0.smithi020.stderr:problem writing to /var/log/ceph/ceph-client.0.21960.log: (28) No space left on device

I'm filing this against ceph because this seems like an excessive amount of logging.


Related issues

Copied to Ceph - Backport #15081: hammer: When logging to a file fails, ceph logs excessively to stderr Resolved

Associated revisions

Revision d9ac0474 (diff)
Added by Sage Weil about 3 years ago

log: do not repeat errors to stderr

If we get an error writing to the log, log it only once to stderr.
This avoids generating, say, 72 GB of ENOSPC errors in
teuthology.log when /var/log fills up.

Fixes: #14616
Signed-off-by: Sage Weil <>

Revision 04989695 (diff)
Added by Sage Weil over 2 years ago

log: do not repeat errors to stderr

If we get an error writing to the log, log it only once to stderr.
This avoids generating, say, 72 GB of ENOSPC errors in
teuthology.log when /var/log fills up.

Fixes: #14616
Signed-off-by: Sage Weil <>
(cherry picked from commit d9ac0474b864afda58a44b9012cca4bbc6aaf509)

Conflicts:
src/log/Log.cc (drop m_uid and m_gid which are not used in hammer;
order of do_stderr, do_syslog, do_fd conditional blocks is reversed in
hammer; drop irrelevant speed optimization code from
5bfe05aebfefdff9022f0eb990805758e0edb1dc)

History

#1 Updated by Sage Weil about 3 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to Urgent

#2 Updated by Sage Weil about 3 years ago

  • Status changed from New to Need Review

#3 Updated by Nathan Cutler about 3 years ago

  • Backport set to hammer

#4 Updated by Sage Weil about 3 years ago

  • Status changed from Need Review to Pending Backport

#5 Updated by Nathan Cutler about 3 years ago

  • Copied to Backport #15081: hammer: When logging to a file fails, ceph logs excessively to stderr added

#6 Updated by Nathan Cutler over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF