Bug #51816
monitor segfault on startup in container
0%
Description
The ceph-container project runs a demo container to validate the container build which starts few daemons and tests if everything is ok.
Since yesterday (but the issue seems older), the ceph-mon process start generates a segfault
See the log in attachment and the crash meta below
{ "crash_id": "2021-07-22T21:46:40.157690Z_a755c1b6-5a2e-400c-89d7-b395e4f3ea64", "timestamp": "2021-07-22T21:46:40.157690Z", "process_name": "ceph-mon", "entity_name": "mon.e2bd9d0a1c37", "ceph_version": "17.0.0-6242-gd09a0461", "utsname_hostname": "e2bd9d0a1c37", "utsname_sysname": "Linux", "utsname_release": "5.10.0-0.bpo.7-amd64", "utsname_version": "#1 SMP Debian 5.10.40-1~bpo10+1 (2021-06-04)", "utsname_machine": "x86_64", "os_name": "CentOS Linux", "os_id": "centos", "os_version_id": "8", "os_version": "8", "backtrace": [ "/lib64/libpthread.so.0(+0x12b20) [0x7f8870fdcb20]", "/lib64/libc.so.6(+0x160805) [0x7f886fd66805]", "(StackStringBuf<4096ul>::xsputn(char const*, long)+0x2d8) [0x5576ebcd8ec8]", "(std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x154) [0x7f8870680da4]", "(LogMonitor::log_external(LogEntry const&)+0xe90) [0x5576ebd64610]", "(LogMonitor::update_from_paxos(bool*)+0x19b9) [0x5576ebd6fb39]", "(PaxosService::refresh(bool*)+0x10e) [0x5576ebe4dd0e]", "(Monitor::refresh_from_paxos(bool*)+0x18c) [0x5576ebce684c]", "(Paxos::do_refresh()+0x57) [0x5576ebe40257]", "(Paxos::commit_finish()+0x753) [0x5576ebe494a3]", "(C_Committed::finish(int)+0x45) [0x5576ebe4d245]", "(Context::complete(int)+0xd) [0x5576ebd23efd]", "(MonitorDBStore::C_DoTransaction::finish(int)+0x98) [0x5576ebe4cf68]", "(Context::complete(int)+0xd) [0x5576ebd23efd]", "(Finisher::finisher_thread_entry()+0x18c) [0x7f8873593bdc]", "(Thread::_entry_func(void*)+0xd) [0x7f88735e7d4d]", "/lib64/libpthread.so.0(+0x814a) [0x7f8870fd214a]", "clone()" ] }
I can't really determine the exact moment when the issue started to occur but I was able to test few container images.
- Failure
ceph version 17.0.0-6242-gd09a0461 (d09a04617e50c96691fe379f34c1786212ae59ac) quincy (dev)
ceph version 17.0.0-6216-g2c528248 (2c528248dfd933ff6011841ac1e2993789244521) quincy (dev)
- Last Successfull version (~10 days ago)
ceph version 17.0.0-5893-g3e2c8e94 (3e2c8e94fb9fb8421a08ca425b14833a981565a6) quincy (dev)
The ceph container is based on CentOS 8.4 distro
History
#1 Updated by Neha Ojha over 2 years ago
- Assignee set to Sage Weil
This is related to https://github.com/ceph/ceph/pull/42014. I know there have been a few follow-on fixes for this PR, are you testing with latest master?
#2 Updated by Dimitri Savineau over 2 years ago
I tested yesterday with the latest master build available on shaman : "ceph version 17.0.0-6285-gc011af69 (c011af69030be50af1f5b23ecedb670f6cde2d7c) quincy (dev)" without success.
I will test again next monday
#3 Updated by Dimitri Savineau over 2 years ago
Still the same issue with the latest shaman build [1]
ceph version 17.0.0-6387-gf0027c05 (f0027c05c6d9de386fe963c04791f8b64dcaf290) quincy (dev)
[1] https://shaman.ceph.com/builds/ceph/master/f0027c05c6d9de386fe963c04791f8b64dcaf290/
#4 Updated by Neha Ojha over 2 years ago
- Priority changed from Normal to Urgent
#5 Updated by Yaarit Hatuka over 2 years ago
"stack_sig" key is missing from the crash metadata; do you see it in any other similar crashes?
#6 Updated by Dimitri Savineau over 2 years ago
I assume that the "stack_sig" key is only available from the ceph crash info command right ?
The issue here, is that the ceph-mon starts but crashs so the crash dump isn't uploaded.
That's why the information are only coming from /var/lib/ceph/crash/<crash timestramp>/{meta,log} files
#7 Updated by Sage Weil over 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 42528
#8 Updated by Sridhar Seshasayee over 2 years ago
I am observing this very early when running CBT tests and when running qa/standalone tests. Bringing up ceph-mon generates the segfault:
$ sudo /usr/local/bin/ceph-run sudo sh -c "ulimit -n 16384 && ulimit -c unlimited && exec /usr/local/bin/ceph-mon -c /tmp/cbt/ceph/ceph.conf -i a --keyring=/tmp/cbt/ceph/client.admin/keyring --pid-file=/tmp/cbt/ceph/pid/sseshasa@incerta06.pid" *** Caught signal (Segmentation fault) ** in thread 7f340958a700 thread_name:ceph-mon ceph version 17.0.0-6387-gf0027c05c6d (f0027c05c6d9de386fe963c04791f8b64dcaf290) quincy (dev) 1: /lib64/libpthread.so.0(+0x12dd0) [0x7f340644bdd0] 2: /lib64/libc.so.6(+0x15ddb5) [0x7f34051d5db5] 3: (StackStringBuf<4096ul>::xsputn(char const*, long)+0x2d8) [0x564b10795278] 4: (std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x154) [0x7f3405aefda4] 5: (LogMonitor::log_external(LogEntry const&)+0xe37) [0x564b10818607] 6: (LogMonitor::update_from_paxos(bool*)+0x1711) [0x564b108227d1] 7: (PaxosService::refresh(bool*)+0x10a) [0x564b108f716a] 8: (Monitor::refresh_from_paxos(bool*)+0x17c) [0x564b107a151c] 9: (Monitor::init_paxos()+0xfc) [0x564b107a17cc] 10: (Monitor::preinit()+0xbb9) [0x564b107cae79] 11: main() 12: __libc_start_main() 13: _start() 2021-07-27T15:25:50.630+0000 7f340958a700 -1 *** Caught signal (Segmentation fault) ** in thread 7f340958a700 thread_name:ceph-mon
#9 Updated by Kefu Chai over 2 years ago
- Status changed from Fix Under Review to Resolved