Bug #45331
openSegmentation fault
0%
cd66cb7187efd3a03b21b067eb4e1c97071872c8567221b08b852782a83b6b57
Description
I've recently set up a CEPH cluster using cephadm, but noticed that sometimes manager daemons die 2-5 times a day due to segmentation fault:
Apr 29 06:04:33 beta dockerd-current[7064]: *** Caught signal (Segmentation fault) ** Apr 29 06:04:33 beta dockerd-current[7064]: in thread 7fb54bb0a700 thread_name:mgr-fin Apr 29 06:04:33 beta dockerd-current[7064]: ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable) Apr 29 06:04:33 beta dockerd-current[7064]: 1: (()+0x12dc0) [0x7fb5acc02dc0] Apr 29 06:04:33 beta dockerd-current[7064]: 2: (PyDict_SetItem()+0x2be) [0x7fb5ade9195e] Apr 29 06:04:33 beta dockerd-current[7064]: 3: (PyFormatter::dump_pyobject(std::basic_string_view<char, std::char_traits<char> >, _object*)+0x61) [0x56491338ebc1] Apr 29 06:04:33 beta dockerd-current[7064]: 4: (LogEntry::dump(ceph::Formatter*) const+0x5a0) [0x7fb5ae5ea6c0] Apr 29 06:04:33 beta dockerd-current[7064]: 5: (ActivePyModule::notify_clog(LogEntry const&)+0x2f0) [0x5649132e5780] Apr 29 06:04:33 beta dockerd-current[7064]: 6: (Context::complete(int)+0xd) [0x5649132f3d9d] Apr 29 06:04:33 beta dockerd-current[7064]: 7: (Finisher::finisher_thread_entry()+0x1a5) [0x7fb5ae5bb7d5] Apr 29 06:04:33 beta dockerd-current[7064]: 8: (()+0x82de) [0x7fb5acbf82de] Apr 29 06:04:33 beta dockerd-current[7064]: 9: (clone()+0x43) [0x7fb5ab78b133] Apr 29 06:04:33 beta dockerd-current[7064]: debug 2020-04-29T03:04:33.732+0000 7fb54bb0a700 -1 *** Caught signal (Segmentation fault) ** Apr 29 06:04:33 beta dockerd-current[7064]: in thread 7fb54bb0a700 thread_name:mgr-fin Apr 29 06:04:33 beta dockerd-current[7064]: Apr 29 06:04:33 beta dockerd-current[7064]: ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable) Apr 29 06:04:33 beta dockerd-current[7064]: 1: (()+0x12dc0) [0x7fb5acc02dc0] Apr 29 06:04:33 beta dockerd-current[7064]: 2: (PyDict_SetItem()+0x2be) [0x7fb5ade9195e] Apr 29 06:04:33 beta dockerd-current[7064]: 3: (PyFormatter::dump_pyobject(std::basic_string_view<char, std::char_traits<char> >, _object*)+0x61) [0x56491338ebc1] Apr 29 06:04:33 beta dockerd-current[7064]: 4: (LogEntry::dump(ceph::Formatter*) const+0x5a0) [0x7fb5ae5ea6c0] Apr 29 06:04:33 beta dockerd-current[7064]: 5: (ActivePyModule::notify_clog(LogEntry const&)+0x2f0) [0x5649132e5780] Apr 29 06:04:33 beta dockerd-current[7064]: 6: (Context::complete(int)+0xd) [0x5649132f3d9d] Apr 29 06:04:33 beta dockerd-current[7064]: 7: (Finisher::finisher_thread_entry()+0x1a5) [0x7fb5ae5bb7d5] Apr 29 06:04:33 beta dockerd-current[7064]: 8: (()+0x82de) [0x7fb5acbf82de] Apr 29 06:04:33 beta dockerd-current[7064]: 9: (clone()+0x43) [0x7fb5ab78b133] Apr 29 06:04:33 beta dockerd-current[7064]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Josh Durgin almost 4 years ago
Could you attach a coredump of this crash? that would help us isolate where it is happening.
Updated by Fedor Gusev almost 4 years ago
Is there anything specific I need to do to get a coredump under docker? I've already set the core file size to unlimited for this process (in the host system) and changed /proc/sys/kernel/core_pattern to "/var/lib/ceph/crash/core.%e.%p.%h.%t" so the file ends up in host filesystem.
Updated by Josh Durgin almost 4 years ago
I'm not sure if any further setup is required for docker. You can test by killing the process with SIGABRT - that should generate a coredump.
Updated by Fedor Gusev almost 4 years ago
When I run a bash shell under ceph user inside the mgr container and kill it with SIGABRT, the core file is created as expected. However, when I try to kill ceph-mgr process, no core file is created.
Updated by Josh Durgin almost 4 years ago
It looks like you'd need to map the path from the host into the container, e.g. with the argument to docker:
-v /var/lib/ceph/crash:/var/lib/ceph/crashThis is because the path is resolved within the container's filesystem: https://github.com/moby/moby/issues/11740#issuecomment-618223314
Updated by Fedor Gusev almost 4 years ago
I think I managed to get a coredump for one of the segfaults. The file is 1.4 G.
Updated by Josh Durgin almost 4 years ago
Can you upload it with the ceph-post-file command? IIRC it works with large files like this (the large size for a coredump is expected).
Updated by Fedor Gusev almost 4 years ago
Done, the file ID is 84f2c338-ceaa-4ebb-ad86-e6a12b9e5d15
Updated by Patrick Donnelly almost 4 years ago
- Related to Bug #46216: mon: log entry with garbage generated by bad memory access added
Updated by Patrick Donnelly almost 4 years ago
Hi Fedor, I believe this problem is the same as #46216. Can you check that the cluster log shows gibberish MDS name for the message:
2020-06-25T23:52:34.307+0000 7f60f50e5700 0 log_channel(cluster) log [INF] : MDS daemon mds.a.senta03.wtslvp is removed because it is dead or otherwise unavailable.
Updated by Fedor Gusev almost 4 years ago
Yes, there are some messages like this:
июн 20 09:37:28 alpha bash[15635]: mma.vuphmc is removed because it is dead or otherwise unavailable.
июн 20 09:37:42 alpha bash[15635]: mma.vuphmc is removed because it is dead or otherwise unavailable.
июн 20 12:07:04 alpha bash[15635]: ta.wdjisf is removed because it is dead or otherwise unavailable.
июн 20 12:07:12 alpha bash[15635]: ta.wdjisf is removed because it is dead or otherwise unavailable.