Project

General

Profile

Actions

Bug #65039

open

mds: standby-replay segmentation fault in md_log_replay

Added by Patrick Donnelly about 1 month ago. Updated about 1 month ago.

Status:
Triaged
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
squid,reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2024-03-21T03:15:55.310 INFO:journalctl@ceph.mds.h.smithi060.stdout:Mar 21 03:15:55 smithi060 ceph-87dd0fc6-e72e-11ee-95c9-87774f69a715-mds-h[71557]: *** Caught signal (Segmentation fault) **
2024-03-21T03:15:55.310 INFO:journalctl@ceph.mds.h.smithi060.stdout:Mar 21 03:15:55 smithi060 ceph-87dd0fc6-e72e-11ee-95c9-87774f69a715-mds-h[71557]:  in thread 7f7135d7c700 thread_name:md_log_replay

From: /teuthology/pdonnell-2024-03-21_02:37:43-fs:workload-main-distro-default-smithi/7614435/teuthology.log

I logged into the machine and collected a gdb stack trace (attached). Initially I was looking for a deadlock not a segmentation fault. The signal handler for SIGSEGV got deadlocked (predictably) because it was using malloc:

Thread 26 (Thread 0x7f7135d7c700 (LWP 72204)):
#0  0x00007f7148e163d0 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
#1  0x00007f7148e162d3 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4
#2  0x00007f7148e05a55 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#3  0x00007f7148e093e3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) () from /lib64/libtcmalloc.so.4
#4  0x00007f71484409b3 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*> () from /usr/lib64/ceph/libceph-common.so.2
#5  0x00007f7148440aa9 in ceph::ClibBackTrace::demangle[abi:cxx11](char const*) () from /usr/lib64/ceph/libceph-common.so.2
#6  0x00007f7148441025 in ceph::ClibBackTrace::print(std::ostream&) const () from /usr/lib64/ceph/libceph-common.so.2
#7  0x000055c9ae7266dd in handle_oneshot_fatal_signal (signum=11) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/global/signal_handler.cc:331
#8  <signal handler called>
#9  0x00007f7148e05603 in tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) () from /lib64/libtcmalloc.so.4
#10 0x00007f7148e058ae in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
#11 0x00007f7148e05971 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#12 0x00007f7148e093e3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long)) () from /lib64/libtcmalloc.so.4
#13 0x000055c9ae311e17 in EMetaBlob::fullbit::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/include/compact_map.h:27
#14 0x000055c9ae31429d in EMetaBlob::dirlump::_decode_bits (this=0x55c9b25c9770) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/events/EMetaBlob.h:609
#15 0x000055c9ae31c397 in EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/events/EMetaBlob.h:296
#16 0x000055c9ae322551 in EUpdate::replay(MDSRank*) () at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/journal.cc:2252
#17 0x000055c9ae64dd97 in MDLog::_replay_thread (this=0x55c9b18e6000) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:421
#18 0x000055c9ae6543b1 in MDLog::ReplayThread::entry (this=<optimized out>) at /usr/src/debug/ceph-19.0.0-2244.gcab8141b.el8.x86_64/src/mds/MDLog.h:181
#19 0x00007f71471331ca in start_thread () from /lib64/libpthread.so.0
#20 0x00007f71456308d3 in clone () from /lib64/libc.so.6

Unfortunately, I didn't get a chance to dig into frame #13 to see why it segfaulted.


Files

gdb.log (33.8 KB) gdb.log Patrick Donnelly, 03/21/2024 02:16 PM
Actions #1

Updated by Venky Shankar about 1 month ago

  • Status changed from New to Triaged
  • Assignee set to Venky Shankar
Actions

Also available in: Atom PDF