Bug #62962
mds: standby-replay daemon crashes on replay
Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Q/A
Tags:
Backport:
reef,quincy
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Standby-replay daemon crashes during replay when accessing inode map.
Ref: BZ2218759
[root@49cd6ae8516b working]# gdb -q /usr/bin/ceph-mds core.ceph-mds.167.9df534a325934dd2b50fa68d8b8aee29.2006407.1694647922000000 Core was generated by `ceph-mds --fsid=724e0358-2cfc-4a0f-9a99-419999493584 --keyring=/etc/ceph/keyrin'. Program terminated with signal SIGSEGV, Segmentation fault. (gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007f2ad51495b3 in __pthread_kill_internal (signo=11, threadid=<optimized out>) at pthread_kill.c:78 #2 0x00007f2ad50fcd46 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26 #3 0x000055fb3bc25f6a in reraise_fatal (signum=11) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/global/signal_handler.cc:88 #4 handle_oneshot_fatal_signal (signum=11) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/global/signal_handler.cc:363 #5 <signal handler called> #6 0x00007f2ad546f0d3 in std::_Rb_tree_decrement(std::_Rb_tree_node_base*) () from /lib64/libstdc++.so.6 #7 0x000055fb3b8f99d5 in std::_Rb_tree_iterator<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::operator-- (this=<synthetic pointer>) at /usr/include/c++/11/bits/stl_tree.h:302 #8 std::_Rb_tree<int, std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<int>, std::allocator<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_get_insert_unique_pos (this=<optimized out>, __k=@0x55fb7b594140: 1) at /usr/include/c++/11/bits/stl_tree.h:2080 #9 0x000055fb3bcc1522 in std::_Rb_tree<int, std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > >, std::_Select1st<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::_M_get_insert_hint_unique_pos ( __k=@0x55fb7b594140: 1, __position={first = 774547952, second = std::unordered_set with 94540820119553 elements}, this=0x55fc2e2aaa68) at /usr/include/c++/11/bits/stl_tree.h:2209 #10 std::_Rb_tree<int, std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > >, std::_Select1st<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) [clone .constprop.0] [clone .isra.0] (this=0x55fc2e2aaa68, __pos= {first = 774547952, second = std::unordered_set with 94540820119553 elements}) at /usr/include/c++/11/bits/stl_tree.h:2435 #11 0x000055fb3bb7214c in std::map<int, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::operator[] (__k=@0x55fb3f330490: 1, this=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:350 #12 MDSTableClient::got_journaled_ack (this=0x55fb3f330480, tid=<optimized out>) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDSTableClient.cc:222 #13 0x000055fb3bbbd003 in MDLog::_replay_thread (this=0x55fb3f378300) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDLog.cc:1436 #14 0x000055fb3b932861 in MDLog::ReplayThread::entry (this=<optimized out>) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDLog.h:192 #15 0x00007f2ad5147802 in start_thread (arg=<optimized out>) at pthread_create.c:443 #16 0x00007f2ad50e7450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) print mds->whoami $2 = 0 (gdb) p mds->state $3 = MDSMap::STATE_STANDBY_REPLAY (gdb) f 12 #12 MDSTableClient::got_journaled_ack (this=0x55fb3f330480, tid=<optimized out>) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDSTableClient.cc:222 222 pending_commit[tid]->pending_commit_tids[table].erase(tid); (gdb) p table $4 = 1 (gdb) p tid $5 = <optimized out>
Related issues
History
#1 Updated by Venky Shankar 2 months ago
Milind, please update the description with the crash backtrace and debug status as much as possible.
#2 Updated by Venky Shankar 2 months ago
Couple of crash backtraces from internal channel:
{ "backtrace": [ "/lib64/libc.so.6(+0x54df0) [0x7fbdba989df0]", "ceph-mds(+0x4facbf) [0x5632e26c3cbf]", "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x5632e25d6991]", "(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x5632e25dbe33]", "(EOpen::replay(MDSRank*)+0x4f) [0x5632e25e702f]", "(MDLog::_replay_thread()+0x753) [0x5632e2594623]", "ceph-mds(+0x1416d1) [0x5632e230a6d1]", "/lib64/libc.so.6(+0x9f802) [0x7fbdba9d4802]", "/lib64/libc.so.6(+0x3f450) [0x7fbdba974450]" ],
and
{ "backtrace": [ "/lib64/libc.so.6(+0x54df0) [0x7f0d21ccddf0]", "ceph-mds(+0x4d03a6) [0x563b865033a6]", "(MDSTableClient::got_journaled_ack(unsigned long)+0x16b) [0x563b863b415b]", "(MDLog::_replay_thread()+0x753) [0x563b863ff003]", "ceph-mds(+0x141861) [0x563b86174861]", "/lib64/libc.so.6(+0x9f802) [0x7f0d21d18802]", "/lib64/libc.so.6(+0x3f450) [0x7f0d21cb8450]" ],
It does seem like its the s-r mds that's crashing and not the active mds. Milind, please take a look and verify.
#3 Updated by Venky Shankar 2 months ago
- Assignee set to Milind Changire
- Priority changed from Normal to Urgent
- Target version set to v19.0.0
- Source changed from other to Q/A
- Backport set to reef,quincy
- Severity changed from 3 - minor to 1 - critical
- Component(FS) MDS added
#4 Updated by Milind Changire 2 months ago
- Description updated (diff)
#5 Updated by Venky Shankar 11 days ago
- Related to Bug #54741: crash: MDSTableClient::got_journaled_ack(unsigned long) added
#6 Updated by Venky Shankar 11 days ago
- Status changed from New to Duplicate
Duplicate of https://tracker.ceph.com/issues/54741