Documentation #44354
closedcephadm: Log messages are missing
0%
Description
There is a bug in master/octopus which causes OSDs to crash in msgr V2 without any load on the cluster. But that's not what this bug is about.
While investigating that bug, I found that logging is broken.
The first indication of a crash was a health warning:
[WRN] RECENT_CRASH: 2 daemons have recently crashed osd.0 crashed on host node1 at 2020-02-28T13:21:42.954520Z osd.0 crashed on host node1 at 2020-02-28T14:31:24.008213Z
So I went about finding those two crashes in the osd.0 log, so I could see what immediately preceded them.
According to https://docs.ceph.com/docs/master/cephadm/administration/#gathering-log-files I should do:
cephadm logs --name <name-of-daemon>
So I tried that on the "admin node" and got:
admin:~ # cephadm logs --name osd.0 ERROR: must pass --fsid to specify cluster
OK, so I did:
admin:~ # cephadm logs --fsid 2757319a-5a1f-11ea-8132-525400fc4b2d --name osd.0 -- Logs begin at Fri 2020-02-28 12:27:51 CET, end at Fri 2020-02-28 17:35:50 CET. -- -- No entries --
This corresponds to what I get when I run the following command on the admin node:
admin:~ # journalctl -u ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service -- Logs begin at Fri 2020-02-28 12:27:51 CET, end at Fri 2020-02-28 17:35:50 CET. -- -- No entries --
However, if I go to the node where osd.0 is running, the command does provide meaningful output:
node1:~ # cephadm logs --name osd.0 ...
And indeed I found the first crash in it:
Feb 28 14:21:42 node1 bash[17133]: *** Caught signal (Segmentation fault) ** Feb 28 14:21:42 node1 bash[17133]: in thread 7eff6e156700 thread_name:msgr-worker-1 Feb 28 14:21:42 node1 bash[17133]: ceph version 15.1.0-1372-gd5b155df9b (d5b155df9bdcf2ac2e364075f3fc8b1bf1d8cf22) octop> Feb 28 14:21:42 node1 bash[17133]: 1: (()+0x132f0) [0x7eff71e9a2f0] Feb 28 14:21:42 node1 bash[17133]: 2: (AsyncConnection::_stop()+0x9c) [0x55dc7385b20c] Feb 28 14:21:42 node1 bash[17133]: 3: (ProtocolV2::stop()+0x8b) [0x55dc7388508b] Feb 28 14:21:42 node1 bash[17133]: 4: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> cons> Feb 28 14:21:42 node1 bash[17133]: 5: (ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&)+0xce1) [0x55dc7389f> Feb 28 14:21:42 node1 bash[17133]: 6: (ProtocolV2::handle_frame_payload()+0x12b) [0x55dc7389f7fb] Feb 28 14:21:42 node1 bash[17133]: 7: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x55dc7389fb40] Feb 28 14:21:42 node1 bash[17133]: 8: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0> Feb 28 14:21:42 node1 bash[17133]: 9: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55dc73885d64] Feb 28 14:21:42 node1 bash[17133]: 10: (AsyncConnection::process()+0x5fc) [0x55dc7385e2dc] Feb 28 14:21:42 node1 bash[17133]: 11: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, s> Feb 28 14:21:42 node1 bash[17133]: 12: (()+0xd1d968) [0x55dc736b7968] Feb 28 14:21:42 node1 bash[17133]: 13: (()+0xd6bb0) [0x7eff7151cbb0] Feb 28 14:21:42 node1 bash[17133]: 14: (()+0x84f9) [0x7eff71e8f4f9] Feb 28 14:21:42 node1 bash[17133]: 15: (clone()+0x3f) [0x7eff70c36eef]
(Of course, it's not easy to correlate this with the timestamp from "ceph crash ls", because one is in UCT and the other in CET, but let's leave that aside for now.)
The above is background. Here is the "meat" of the bug report:
After the backtrace, the log says:
--- begin dump of recent events --- debug -9999> 2020-02-28T13:21:13.192+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc81befc00 auth_method 0 ...
This is expected to be 10000 lines of recent events, right? But no, it only gets up to
debug -7334> 2020-02-28T13:21:21.104+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc8153ac00 auth_method 0
At that point the crash dump is cut off (maybe it doesn't fit into some buffer somewhere?):
Feb 28 14:21:43 node1 bash[17133]: debug -7334> 2020-02-28T13:21:21.104+0000 7eff6e156700 10 monclient: get_auth_request con 0x55dc8153ac00 auth_method 0 Feb 28 14:21:44 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Main process exited, code=exited, status=1/FAILURE Feb 28 14:21:45 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Unit entered failed state. Feb 28 14:21:45 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Failed with result 'exit-code'. Feb 28 14:21:55 node1 systemd[1]: ceph-2757319a-5a1f-11ea-8132-525400fc4b2d@osd.0.service: Service RestartSec=10s expired, scheduling restart. Feb 28 14:21:55 node1 systemd[1]: Stopped Ceph osd.0 for 2757319a-5a1f-11ea-8132-525400fc4b2d. Feb 28 14:21:55 node1 systemd[1]: Starting Ceph osd.0 for 2757319a-5a1f-11ea-8132-525400fc4b2d...
Updated by Nathan Cutler about 4 years ago
https://github.com/ceph/ceph/pull/33627 is an attempt to clarify the cephadm log gathering docs, and it has some relation to (but does not fix) this bug
Updated by Sebastian Wagner about 4 years ago
- Subject changed from Log messages are missing to cephadm: Log messages are missing
- Category set to cephadm
Updated by Sebastian Wagner almost 4 years ago
- Tracker changed from Bug to Documentation
Updated by Sebastian Wagner almost 4 years ago
We need this in the docs:
Create the /var/log/journal directory, so journal logs will be persisted.
Updated by Nathan Cutler almost 4 years ago
Indeed, this was most probably caused by "Storage=auto" (the default) and non-existent /var/log/journal
Now, it probably makes sense to let systemd create this directory (if we told users to create it themselves, who knows what permissions/ownership it would end up with?), so we might consider recommending that users do something like:
1. edit /etc/systemd/journald.conf: change #storage=auto to storage=persistent
2. systemctl restart systemd-journald
Updated by Sebastian Wagner about 3 years ago
- Status changed from New to Duplicate
fixed by using journald as log driver