Project

General

Profile

Actions

Documentation #44354

closed

cephadm: Log messages are missing

Added by Nathan Cutler about 4 years ago. Updated about 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
cephadm
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

There is a bug in master/octopus which causes OSDs to crash in msgr V2 without any load on the cluster. But that's not what this bug is about.

While investigating that bug, I found that logging is broken.

The first indication of a crash was a health warning:

[WRN] RECENT_CRASH: 2 daemons have recently crashed
    osd.0 crashed on host node1 at 2020-02-28T13:21:42.954520Z
    osd.0 crashed on host node1 at 2020-02-28T14:31:24.008213Z

So I went about finding those two crashes in the osd.0 log, so I could see what immediately preceded them.

According to https://docs.ceph.com/docs/master/cephadm/administration/#gathering-log-files I should do:

cephadm logs --name <name-of-daemon>

So I tried that on the "admin node" and got:

admin:~ # cephadm logs --name osd.0
ERROR: must pass --fsid to specify cluster

OK, so I did:

admin:~ # cephadm logs --fsid 2757319a-5a1f-11ea-8132-525400fc4b2d --name osd.0
-- Logs begin at Fri 2020-02-28 12:27:51 CET, end at Fri 2020-02-28 17:35:50 CET. --
-- No entries --

This corresponds to what I get when I run the following command on the admin node:

admin:~ # journalctl -u ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service
-- Logs begin at Fri 2020-02-28 12:27:51 CET, end at Fri 2020-02-28 17:35:50 CET. --
-- No entries --

However, if I go to the node where osd.0 is running, the command does provide meaningful output:

node1:~ # cephadm logs --name osd.0
...

And indeed I found the first crash in it:

Feb 28 14:21:42 node1 bash[17133]: *** Caught signal (Segmentation fault) **
Feb 28 14:21:42 node1 bash[17133]:  in thread 7eff6e156700 thread_name:msgr-worker-1
Feb 28 14:21:42 node1 bash[17133]:  ceph version 15.1.0-1372-gd5b155df9b (d5b155df9bdcf2ac2e364075f3fc8b1bf1d8cf22) octop>
Feb 28 14:21:42 node1 bash[17133]:  1: (()+0x132f0) [0x7eff71e9a2f0]
Feb 28 14:21:42 node1 bash[17133]:  2: (AsyncConnection::_stop()+0x9c) [0x55dc7385b20c]
Feb 28 14:21:42 node1 bash[17133]:  3: (ProtocolV2::stop()+0x8b) [0x55dc7388508b]
Feb 28 14:21:42 node1 bash[17133]:  4: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> cons>
Feb 28 14:21:42 node1 bash[17133]:  5: (ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&)+0xce1) [0x55dc7389f>
Feb 28 14:21:42 node1 bash[17133]:  6: (ProtocolV2::handle_frame_payload()+0x12b) [0x55dc7389f7fb]
Feb 28 14:21:42 node1 bash[17133]:  7: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x55dc7389fb40]
Feb 28 14:21:42 node1 bash[17133]:  8: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0>
Feb 28 14:21:42 node1 bash[17133]:  9: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55dc73885d64]
Feb 28 14:21:42 node1 bash[17133]:  10: (AsyncConnection::process()+0x5fc) [0x55dc7385e2dc]
Feb 28 14:21:42 node1 bash[17133]:  11: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, s>
Feb 28 14:21:42 node1 bash[17133]:  12: (()+0xd1d968) [0x55dc736b7968]
Feb 28 14:21:42 node1 bash[17133]:  13: (()+0xd6bb0) [0x7eff7151cbb0]
Feb 28 14:21:42 node1 bash[17133]:  14: (()+0x84f9) [0x7eff71e8f4f9]
Feb 28 14:21:42 node1 bash[17133]:  15: (clone()+0x3f) [0x7eff70c36eef]

(Of course, it's not easy to correlate this with the timestamp from "ceph crash ls", because one is in UCT and the other in CET, but let's leave that aside for now.)

The above is background. Here is the "meat" of the bug report:

After the backtrace, the log says:

--- begin dump of recent events ---
debug  -9999> 2020-02-28T13:21:13.192+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc81befc00 auth_method 0
...

This is expected to be 10000 lines of recent events, right? But no, it only gets up to

debug  -7334> 2020-02-28T13:21:21.104+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc8153ac00 auth_method 0

At that point the crash dump is cut off (maybe it doesn't fit into some buffer somewhere?):

Feb 28 14:21:43 node1 bash[17133]: debug  -7334> 2020-02-28T13:21:21.104+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc8153ac00 auth_method 0
Feb 28 14:21:44 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 14:21:45 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Unit entered failed state.
Feb 28 14:21:45 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Failed with result 'exit-code'.
Feb 28 14:21:55 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Service RestartSec=10s expired, scheduling restart.
Feb 28 14:21:55 node1 systemd[1]: Stopped Ceph osd.0 for 2757319a-5a1f-11ea-8132-525400fc4b2d.
Feb 28 14:21:55 node1 systemd[1]: Starting Ceph osd.0 for 2757319a-5a1f-11ea-8132-525400fc4b2d...
Actions #1

Updated by Nathan Cutler about 4 years ago

https://github.com/ceph/ceph/pull/33627 is an attempt to clarify the cephadm log gathering docs, and it has some relation to (but does not fix) this bug

Actions #2

Updated by Sebastian Wagner about 4 years ago

  • Subject changed from Log messages are missing to cephadm: Log messages are missing
  • Category set to cephadm
Actions #4

Updated by Sebastian Wagner almost 4 years ago

  • Tracker changed from Bug to Documentation
Actions #5

Updated by Sebastian Wagner almost 4 years ago

We need this in the docs:

Create the /var/log/journal directory, so journal logs will be persisted.

Actions #6

Updated by Nathan Cutler almost 4 years ago

Indeed, this was most probably caused by "Storage=auto" (the default) and non-existent /var/log/journal

Now, it probably makes sense to let systemd create this directory (if we told users to create it themselves, who knows what permissions/ownership it would end up with?), so we might consider recommending that users do something like:

1. edit /etc/systemd/journald.conf: change #storage=auto to storage=persistent
2. systemctl restart systemd-journald

Actions #7

Updated by Sebastian Wagner about 3 years ago

  • Status changed from New to Duplicate

fixed by using journald as log driver

Actions

Also available in: Atom PDF