Project

General

Profile

Actions

Documentation #44354

closed

cephadm: Log messages are missing

Added by Nathan Cutler about 4 years ago. Updated about 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
cephadm
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

There is a bug in master/octopus which causes OSDs to crash in msgr V2 without any load on the cluster. But that's not what this bug is about.

While investigating that bug, I found that logging is broken.

The first indication of a crash was a health warning:

[WRN] RECENT_CRASH: 2 daemons have recently crashed
    osd.0 crashed on host node1 at 2020-02-28T13:21:42.954520Z
    osd.0 crashed on host node1 at 2020-02-28T14:31:24.008213Z

So I went about finding those two crashes in the osd.0 log, so I could see what immediately preceded them.

According to https://docs.ceph.com/docs/master/cephadm/administration/#gathering-log-files I should do:

cephadm logs --name <name-of-daemon>

So I tried that on the "admin node" and got:

admin:~ # cephadm logs --name osd.0
ERROR: must pass --fsid to specify cluster

OK, so I did:

admin:~ # cephadm logs --fsid 2757319a-5a1f-11ea-8132-525400fc4b2d --name osd.0
-- Logs begin at Fri 2020-02-28 12:27:51 CET, end at Fri 2020-02-28 17:35:50 CET. --
-- No entries --

This corresponds to what I get when I run the following command on the admin node:

admin:~ # journalctl -u ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service
-- Logs begin at Fri 2020-02-28 12:27:51 CET, end at Fri 2020-02-28 17:35:50 CET. --
-- No entries --

However, if I go to the node where osd.0 is running, the command does provide meaningful output:

node1:~ # cephadm logs --name osd.0
...

And indeed I found the first crash in it:

Feb 28 14:21:42 node1 bash[17133]: *** Caught signal (Segmentation fault) **
Feb 28 14:21:42 node1 bash[17133]:  in thread 7eff6e156700 thread_name:msgr-worker-1
Feb 28 14:21:42 node1 bash[17133]:  ceph version 15.1.0-1372-gd5b155df9b (d5b155df9bdcf2ac2e364075f3fc8b1bf1d8cf22) octop>
Feb 28 14:21:42 node1 bash[17133]:  1: (()+0x132f0) [0x7eff71e9a2f0]
Feb 28 14:21:42 node1 bash[17133]:  2: (AsyncConnection::_stop()+0x9c) [0x55dc7385b20c]
Feb 28 14:21:42 node1 bash[17133]:  3: (ProtocolV2::stop()+0x8b) [0x55dc7388508b]
Feb 28 14:21:42 node1 bash[17133]:  4: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> cons>
Feb 28 14:21:42 node1 bash[17133]:  5: (ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&)+0xce1) [0x55dc7389f>
Feb 28 14:21:42 node1 bash[17133]:  6: (ProtocolV2::handle_frame_payload()+0x12b) [0x55dc7389f7fb]
Feb 28 14:21:42 node1 bash[17133]:  7: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x55dc7389fb40]
Feb 28 14:21:42 node1 bash[17133]:  8: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0>
Feb 28 14:21:42 node1 bash[17133]:  9: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55dc73885d64]
Feb 28 14:21:42 node1 bash[17133]:  10: (AsyncConnection::process()+0x5fc) [0x55dc7385e2dc]
Feb 28 14:21:42 node1 bash[17133]:  11: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, s>
Feb 28 14:21:42 node1 bash[17133]:  12: (()+0xd1d968) [0x55dc736b7968]
Feb 28 14:21:42 node1 bash[17133]:  13: (()+0xd6bb0) [0x7eff7151cbb0]
Feb 28 14:21:42 node1 bash[17133]:  14: (()+0x84f9) [0x7eff71e8f4f9]
Feb 28 14:21:42 node1 bash[17133]:  15: (clone()+0x3f) [0x7eff70c36eef]

(Of course, it's not easy to correlate this with the timestamp from "ceph crash ls", because one is in UCT and the other in CET, but let's leave that aside for now.)

The above is background. Here is the "meat" of the bug report:

After the backtrace, the log says:

--- begin dump of recent events ---
debug  -9999> 2020-02-28T13:21:13.192+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc81befc00 auth_method 0
...

This is expected to be 10000 lines of recent events, right? But no, it only gets up to

debug  -7334> 2020-02-28T13:21:21.104+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc8153ac00 auth_method 0

At that point the crash dump is cut off (maybe it doesn't fit into some buffer somewhere?):

Feb 28 14:21:43 node1 bash[17133]: debug  -7334> 2020-02-28T13:21:21.104+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc8153ac00 auth_method 0
Feb 28 14:21:44 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 14:21:45 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Unit entered failed state.
Feb 28 14:21:45 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Failed with result 'exit-code'.
Feb 28 14:21:55 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Service RestartSec=10s expired, scheduling restart.
Feb 28 14:21:55 node1 systemd[1]: Stopped Ceph osd.0 for 2757319a-5a1f-11ea-8132-525400fc4b2d.
Feb 28 14:21:55 node1 systemd[1]: Starting Ceph osd.0 for 2757319a-5a1f-11ea-8132-525400fc4b2d...
Actions

Also available in: Atom PDF