Project

General

Profile

Bug #51816

monitor segfault on startup in container

Added by Dimitri Savineau 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The ceph-container project runs a demo container to validate the container build which starts few daemons and tests if everything is ok.

Since yesterday (but the issue seems older), the ceph-mon process start generates a segfault

See the log in attachment and the crash meta below

{
    "crash_id": "2021-07-22T21:46:40.157690Z_a755c1b6-5a2e-400c-89d7-b395e4f3ea64",
    "timestamp": "2021-07-22T21:46:40.157690Z",
    "process_name": "ceph-mon",
    "entity_name": "mon.e2bd9d0a1c37",
    "ceph_version": "17.0.0-6242-gd09a0461",
    "utsname_hostname": "e2bd9d0a1c37",
    "utsname_sysname": "Linux",
    "utsname_release": "5.10.0-0.bpo.7-amd64",
    "utsname_version": "#1 SMP Debian 5.10.40-1~bpo10+1 (2021-06-04)",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f8870fdcb20]",
        "/lib64/libc.so.6(+0x160805) [0x7f886fd66805]",
        "(StackStringBuf<4096ul>::xsputn(char const*, long)+0x2d8) [0x5576ebcd8ec8]",
        "(std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x154) [0x7f8870680da4]",
        "(LogMonitor::log_external(LogEntry const&)+0xe90) [0x5576ebd64610]",
        "(LogMonitor::update_from_paxos(bool*)+0x19b9) [0x5576ebd6fb39]",
        "(PaxosService::refresh(bool*)+0x10e) [0x5576ebe4dd0e]",
        "(Monitor::refresh_from_paxos(bool*)+0x18c) [0x5576ebce684c]",
        "(Paxos::do_refresh()+0x57) [0x5576ebe40257]",
        "(Paxos::commit_finish()+0x753) [0x5576ebe494a3]",
        "(C_Committed::finish(int)+0x45) [0x5576ebe4d245]",
        "(Context::complete(int)+0xd) [0x5576ebd23efd]",
        "(MonitorDBStore::C_DoTransaction::finish(int)+0x98) [0x5576ebe4cf68]",
        "(Context::complete(int)+0xd) [0x5576ebd23efd]",
        "(Finisher::finisher_thread_entry()+0x18c) [0x7f8873593bdc]",
        "(Thread::_entry_func(void*)+0xd) [0x7f88735e7d4d]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f8870fd214a]",
        "clone()" 
    ]
}

I can't really determine the exact moment when the issue started to occur but I was able to test few container images.

  • Failure
    ceph version 17.0.0-6242-gd09a0461 (d09a04617e50c96691fe379f34c1786212ae59ac) quincy (dev)
    ceph version 17.0.0-6216-g2c528248 (2c528248dfd933ff6011841ac1e2993789244521) quincy (dev)
  • Last Successfull version (~10 days ago)
    ceph version 17.0.0-5893-g3e2c8e94 (3e2c8e94fb9fb8421a08ca425b14833a981565a6) quincy (dev)

The ceph container is based on CentOS 8.4 distro

log View - crash log (61.2 KB) Dimitri Savineau, 07/22/2021 10:18 PM

History

#1 Updated by Neha Ojha 2 months ago

  • Assignee set to Sage Weil

This is related to https://github.com/ceph/ceph/pull/42014. I know there have been a few follow-on fixes for this PR, are you testing with latest master?

#2 Updated by Dimitri Savineau 2 months ago

I tested yesterday with the latest master build available on shaman : "ceph version 17.0.0-6285-gc011af69 (c011af69030be50af1f5b23ecedb670f6cde2d7c) quincy (dev)" without success.

I will test again next monday

#3 Updated by Dimitri Savineau 2 months ago

Still the same issue with the latest shaman build [1]

ceph version 17.0.0-6387-gf0027c05 (f0027c05c6d9de386fe963c04791f8b64dcaf290) quincy (dev)

[1] https://shaman.ceph.com/builds/ceph/master/f0027c05c6d9de386fe963c04791f8b64dcaf290/

#4 Updated by Neha Ojha about 2 months ago

  • Priority changed from Normal to Urgent

#5 Updated by Yaarit Hatuka about 2 months ago

"stack_sig" key is missing from the crash metadata; do you see it in any other similar crashes?

#6 Updated by Dimitri Savineau about 2 months ago

I assume that the "stack_sig" key is only available from the ceph crash info command right ?

The issue here, is that the ceph-mon starts but crashs so the crash dump isn't uploaded.

That's why the information are only coming from /var/lib/ceph/crash/<crash timestramp>/{meta,log} files

#7 Updated by Sage Weil about 2 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 42528

#8 Updated by Sridhar Seshasayee about 2 months ago

I am observing this very early when running CBT tests and when running qa/standalone tests. Bringing up ceph-mon generates the segfault:

$ sudo /usr/local/bin/ceph-run sudo sh -c "ulimit -n 16384 && ulimit -c unlimited && exec /usr/local/bin/ceph-mon -c /tmp/cbt/ceph/ceph.conf -i a --keyring=/tmp/cbt/ceph/client.admin/keyring --pid-file=/tmp/cbt/ceph/pid/sseshasa@incerta06.pid" 
*** Caught signal (Segmentation fault) **
 in thread 7f340958a700 thread_name:ceph-mon
 ceph version 17.0.0-6387-gf0027c05c6d (f0027c05c6d9de386fe963c04791f8b64dcaf290) quincy (dev)
 1: /lib64/libpthread.so.0(+0x12dd0) [0x7f340644bdd0]
 2: /lib64/libc.so.6(+0x15ddb5) [0x7f34051d5db5]
 3: (StackStringBuf<4096ul>::xsputn(char const*, long)+0x2d8) [0x564b10795278]
 4: (std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x154) [0x7f3405aefda4]
 5: (LogMonitor::log_external(LogEntry const&)+0xe37) [0x564b10818607]
 6: (LogMonitor::update_from_paxos(bool*)+0x1711) [0x564b108227d1]
 7: (PaxosService::refresh(bool*)+0x10a) [0x564b108f716a]
 8: (Monitor::refresh_from_paxos(bool*)+0x17c) [0x564b107a151c]
 9: (Monitor::init_paxos()+0xfc) [0x564b107a17cc]
 10: (Monitor::preinit()+0xbb9) [0x564b107cae79]
 11: main()
 12: __libc_start_main()
 13: _start()
2021-07-27T15:25:50.630+0000 7f340958a700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f340958a700 thread_name:ceph-mon

#9 Updated by Kefu Chai about 2 months ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF