Bug #45331: Segmentation fault - mgr - Ceph

Actions

Copy link

Bug #45331

open

Segmentation fault

Added by Fedor Gusev almost 4 years ago. Updated almost 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

cd66cb7187efd3a03b21b067eb4e1c97071872c8567221b08b852782a83b6b57

Crash signature (v2):

Description

I've recently set up a CEPH cluster using cephadm, but noticed that sometimes manager daemons die 2-5 times a day due to segmentation fault:

Apr 29 06:04:33 beta dockerd-current[7064]: *** Caught signal (Segmentation fault) **
Apr 29 06:04:33 beta dockerd-current[7064]:  in thread 7fb54bb0a700 thread_name:mgr-fin
Apr 29 06:04:33 beta dockerd-current[7064]:  ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)
Apr 29 06:04:33 beta dockerd-current[7064]:  1: (()+0x12dc0) [0x7fb5acc02dc0]
Apr 29 06:04:33 beta dockerd-current[7064]:  2: (PyDict_SetItem()+0x2be) [0x7fb5ade9195e]
Apr 29 06:04:33 beta dockerd-current[7064]:  3: (PyFormatter::dump_pyobject(std::basic_string_view<char, std::char_traits<char> >, _object*)+0x61) [0x56491338ebc1]
Apr 29 06:04:33 beta dockerd-current[7064]:  4: (LogEntry::dump(ceph::Formatter*) const+0x5a0) [0x7fb5ae5ea6c0]
Apr 29 06:04:33 beta dockerd-current[7064]:  5: (ActivePyModule::notify_clog(LogEntry const&)+0x2f0) [0x5649132e5780]
Apr 29 06:04:33 beta dockerd-current[7064]:  6: (Context::complete(int)+0xd) [0x5649132f3d9d]
Apr 29 06:04:33 beta dockerd-current[7064]:  7: (Finisher::finisher_thread_entry()+0x1a5) [0x7fb5ae5bb7d5]
Apr 29 06:04:33 beta dockerd-current[7064]:  8: (()+0x82de) [0x7fb5acbf82de]
Apr 29 06:04:33 beta dockerd-current[7064]:  9: (clone()+0x43) [0x7fb5ab78b133]
Apr 29 06:04:33 beta dockerd-current[7064]: debug 2020-04-29T03:04:33.732+0000 7fb54bb0a700 -1 *** Caught signal (Segmentation fault) **
Apr 29 06:04:33 beta dockerd-current[7064]:  in thread 7fb54bb0a700 thread_name:mgr-fin
Apr 29 06:04:33 beta dockerd-current[7064]:
Apr 29 06:04:33 beta dockerd-current[7064]:  ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)
Apr 29 06:04:33 beta dockerd-current[7064]:  1: (()+0x12dc0) [0x7fb5acc02dc0]
Apr 29 06:04:33 beta dockerd-current[7064]:  2: (PyDict_SetItem()+0x2be) [0x7fb5ade9195e]
Apr 29 06:04:33 beta dockerd-current[7064]:  3: (PyFormatter::dump_pyobject(std::basic_string_view<char, std::char_traits<char> >, _object*)+0x61) [0x56491338ebc1]
Apr 29 06:04:33 beta dockerd-current[7064]:  4: (LogEntry::dump(ceph::Formatter*) const+0x5a0) [0x7fb5ae5ea6c0]
Apr 29 06:04:33 beta dockerd-current[7064]:  5: (ActivePyModule::notify_clog(LogEntry const&)+0x2f0) [0x5649132e5780]
Apr 29 06:04:33 beta dockerd-current[7064]:  6: (Context::complete(int)+0xd) [0x5649132f3d9d]
Apr 29 06:04:33 beta dockerd-current[7064]:  7: (Finisher::finisher_thread_entry()+0x1a5) [0x7fb5ae5bb7d5]
Apr 29 06:04:33 beta dockerd-current[7064]:  8: (()+0x82de) [0x7fb5acbf82de]
Apr 29 06:04:33 beta dockerd-current[7064]:  9: (clone()+0x43) [0x7fb5ab78b133]
Apr 29 06:04:33 beta dockerd-current[7064]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Josh Durgin almost 4 years ago

Could you attach a coredump of this crash? that would help us isolate where it is happening.

Actions

Copy link

Updated by Neha Ojha almost 4 years ago

Crash signature (v1) updated (diff)

Actions

Copy link

Updated by Fedor Gusev almost 4 years ago

Is there anything specific I need to do to get a coredump under docker? I've already set the core file size to unlimited for this process (in the host system) and changed /proc/sys/kernel/core_pattern to "/var/lib/ceph/crash/core.%e.%p.%h.%t" so the file ends up in host filesystem.

Actions

Copy link

Updated by Josh Durgin almost 4 years ago

I'm not sure if any further setup is required for docker. You can test by killing the process with SIGABRT - that should generate a coredump.

Actions

Copy link

Updated by Fedor Gusev almost 4 years ago

When I run a bash shell under ceph user inside the mgr container and kill it with SIGABRT, the core file is created as expected. However, when I try to kill ceph-mgr process, no core file is created.

Actions

Copy link

Updated by Josh Durgin almost 4 years ago

It looks like you'd need to map the path from the host into the container, e.g. with the argument to docker:

-v /var/lib/ceph/crash:/var/lib/ceph/crash

This is because the path is resolved within the container's filesystem: https://github.com/moby/moby/issues/11740#issuecomment-618223314

Actions

Copy link

Updated by Fedor Gusev almost 4 years ago

I think I managed to get a coredump for one of the segfaults. The file is 1.4 G.

Actions

Copy link

Updated by Josh Durgin almost 4 years ago

Can you upload it with the ceph-post-file command? IIRC it works with large files like this (the large size for a coredump is expected).

Actions

Copy link

Updated by Fedor Gusev almost 4 years ago

Done, the file ID is 84f2c338-ceaa-4ebb-ad86-e6a12b9e5d15

Actions

Copy link

#10

Updated by Patrick Donnelly almost 4 years ago

Related to Bug #46216: mon: log entry with garbage generated by bad memory access added

Actions

Copy link

#11

Updated by Patrick Donnelly almost 4 years ago

Hi Fedor, I believe this problem is the same as #46216. Can you check that the cluster log shows gibberish MDS name for the message:

2020-06-25T23:52:34.307+0000 7f60f50e5700 0 log_channel(cluster) log [INF] : MDS daemon mds.a.senta03.wtslvp is removed because it is dead or otherwise unavailable.

Actions

Copy link

#12

Updated by Fedor Gusev almost 4 years ago

Yes, there are some messages like this:

июн 20 09:37:28 alpha bash[15635]: mma.vuphmc is removed because it is dead or otherwise unavailable.
июн 20 09:37:42 alpha bash[15635]: mma.vuphmc is removed because it is dead or otherwise unavailable.
июн 20 12:07:04 alpha bash[15635]: ta.wdjisf is removed because it is dead or otherwise unavailable.
июн 20 12:07:12 alpha bash[15635]: ta.wdjisf is removed because it is dead or otherwise unavailable.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #45331

Segmentation fault

Updated by Josh Durgin almost 4 years ago

Updated by Neha Ojha almost 4 years ago

Updated by Fedor Gusev almost 4 years ago

Updated by Josh Durgin almost 4 years ago

Updated by Fedor Gusev almost 4 years ago

Updated by Josh Durgin almost 4 years ago

Updated by Fedor Gusev almost 4 years ago

Updated by Josh Durgin almost 4 years ago

Updated by Fedor Gusev almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Fedor Gusev almost 4 years ago