Actions
Bug #65039
openmds: standby-replay segmentation fault in md_log_replay
Status:
Triaged
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Q/A
Tags:
Backport:
squid,reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2024-03-21T03:15:55.310 INFO:journalctl@ceph.mds.h.smithi060.stdout:Mar 21 03:15:55 smithi060 ceph-87dd0fc6-e72e-11ee-95c9-87774f69a715-mds-h[71557]: *** Caught signal (Segmentation fault) ** 2024-03-21T03:15:55.310 INFO:journalctl@ceph.mds.h.smithi060.stdout:Mar 21 03:15:55 smithi060 ceph-87dd0fc6-e72e-11ee-95c9-87774f69a715-mds-h[71557]: in thread 7f7135d7c700 thread_name:md_log_replay
From: /teuthology/pdonnell-2024-03-21_02:37:43-fs:workload-main-distro-default-smithi/7614435/teuthology.log
I logged into the machine and collected a gdb stack trace (attached). Initially I was looking for a deadlock not a segmentation fault. The signal handler for SIGSEGV got deadlocked (predictably) because it was using malloc:
Thread 26 (Thread 0x7f7135d7c700 (LWP 72204)): #0 0x00007f7148e163d0 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4 #1 0x00007f7148e162d3 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4 #2 0x00007f7148e05a55 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4 #3 0x00007f7148e093e3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) () from /lib64/libtcmalloc.so.4 #4 0x00007f71484409b3 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*> () from /usr/lib64/ceph/libceph-common.so.2 #5 0x00007f7148440aa9 in ceph::ClibBackTrace::demangle[abi:cxx11](char const*) () from /usr/lib64/ceph/libceph-common.so.2 #6 0x00007f7148441025 in ceph::ClibBackTrace::print(std::ostream&) const () from /usr/lib64/ceph/libceph-common.so.2 #7 0x000055c9ae7266dd in handle_oneshot_fatal_signal (signum=11) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/global/signal_handler.cc:331 #8 <signal handler called> #9 0x00007f7148e05603 in tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) () from /lib64/libtcmalloc.so.4 #10 0x00007f7148e058ae in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4 #11 0x00007f7148e05971 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4 #12 0x00007f7148e093e3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) () from /lib64/libtcmalloc.so.4 #13 0x000055c9ae311e17 in EMetaBlob::fullbit::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/include/compact_map.h:27 #14 0x000055c9ae31429d in EMetaBlob::dirlump::_decode_bits (this=0x55c9b25c9770) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/events/EMetaBlob.h:609 #15 0x000055c9ae31c397 in EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/events/EMetaBlob.h:296 #16 0x000055c9ae322551 in EUpdate::replay(MDSRank*) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/journal.cc:2252 #17 0x000055c9ae64dd97 in MDLog::_replay_thread (this=0x55c9b18e6000) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:421 #18 0x000055c9ae6543b1 in MDLog::ReplayThread::entry (this=<optimized out>) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/MDLog.h:181 #19 0x00007f71471331ca in start_thread () from /lib64/libpthread.so.0 #20 0x00007f71456308d3 in clone () from /lib64/libc.so.6
Unfortunately, I didn't get a chance to dig into frame #13 to see why it segfaulted.
Files
Updated by Venky Shankar about 1 month ago
- Status changed from New to Triaged
- Assignee set to Venky Shankar
Actions