Project

General

Profile

Actions

Bug #62962

closed

mds: standby-replay daemon crashes on replay

Added by Milind Changire 8 months ago. Updated 5 months ago.

Status:
Duplicate
Priority:
Urgent
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
reef,quincy
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Standby-replay daemon crashes during replay when accessing inode map.
Ref: BZ2218759

[root@49cd6ae8516b working]# gdb -q /usr/bin/ceph-mds core.ceph-mds.167.9df534a325934dd2b50fa68d8b8aee29.2006407.1694647922000000
Core was generated by `ceph-mds --fsid=724e0358-2cfc-4a0f-9a99-419999493584 --keyring=/etc/ceph/keyrin'.
Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f2ad51495b3 in __pthread_kill_internal (signo=11, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007f2ad50fcd46 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  0x000055fb3bc25f6a in reraise_fatal (signum=11) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/global/signal_handler.cc:88
#4  handle_oneshot_fatal_signal (signum=11) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/global/signal_handler.cc:363
#5  <signal handler called>
#6  0x00007f2ad546f0d3 in std::_Rb_tree_decrement(std::_Rb_tree_node_base*) () from /lib64/libstdc++.so.6
#7  0x000055fb3b8f99d5 in std::_Rb_tree_iterator<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::operator-- (this=<synthetic pointer>)
    at /usr/include/c++/11/bits/stl_tree.h:302
#8  std::_Rb_tree<int, std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<int>, std::allocator<std::pair<int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_get_insert_unique_pos (this=<optimized out>, __k=@0x55fb7b594140: 1) at /usr/include/c++/11/bits/stl_tree.h:2080
#9  0x000055fb3bcc1522 in std::_Rb_tree<int, std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > >, std::_Select1st<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::_M_get_insert_hint_unique_pos (
    __k=@0x55fb7b594140: 1, __position={first = 774547952, second = std::unordered_set with 94540820119553 elements}, this=0x55fc2e2aaa68) at /usr/include/c++/11/bits/stl_tree.h:2209
#10 std::_Rb_tree<int, std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > >, std::_Select1st<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > >, std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) [clone .constprop.0] [clone .isra.0] (this=0x55fc2e2aaa68, __pos=
  {first = 774547952, second = std::unordered_set with 94540820119553 elements}) at /usr/include/c++/11/bits/stl_tree.h:2435
#11 0x000055fb3bb7214c in std::map<int, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> >, std::less<int>, std::allocator<std::pair<int const, std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > > > >::operator[] (__k=@0x55fb3f330490: 1, 
    this=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:350
#12 MDSTableClient::got_journaled_ack (this=0x55fb3f330480, tid=<optimized out>) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDSTableClient.cc:222
#13 0x000055fb3bbbd003 in MDLog::_replay_thread (this=0x55fb3f378300) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDLog.cc:1436
#14 0x000055fb3b932861 in MDLog::ReplayThread::entry (this=<optimized out>) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDLog.h:192
#15 0x00007f2ad5147802 in start_thread (arg=<optimized out>) at pthread_create.c:443
#16 0x00007f2ad50e7450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) print mds->whoami 
$2 = 0

(gdb) p mds->state
$3 = MDSMap::STATE_STANDBY_REPLAY

(gdb) f 12
#12 MDSTableClient::got_journaled_ack (this=0x55fb3f330480, tid=<optimized out>) at /usr/src/debug/ceph-17.2.6-138.el9cp.x86_64/src/mds/MDSTableClient.cc:222
222        pending_commit[tid]->pending_commit_tids[table].erase(tid);
(gdb) p table
$4 = 1
(gdb) p tid
$5 = <optimized out>


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #54741: crash: MDSTableClient::got_journaled_ack(unsigned long)NewVenky Shankar

Actions
Actions #1

Updated by Venky Shankar 8 months ago

Milind, please update the description with the crash backtrace and debug status as much as possible.

Actions #2

Updated by Venky Shankar 7 months ago

Couple of crash backtraces from internal channel:

{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7fbdba989df0]",
        "ceph-mds(+0x4facbf) [0x5632e26c3cbf]",
        "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x5632e25d6991]",
        "(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x5632e25dbe33]",
        "(EOpen::replay(MDSRank*)+0x4f) [0x5632e25e702f]",
        "(MDLog::_replay_thread()+0x753) [0x5632e2594623]",
        "ceph-mds(+0x1416d1) [0x5632e230a6d1]",
        "/lib64/libc.so.6(+0x9f802) [0x7fbdba9d4802]",
        "/lib64/libc.so.6(+0x3f450) [0x7fbdba974450]" 
    ],

and

{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f0d21ccddf0]",
        "ceph-mds(+0x4d03a6) [0x563b865033a6]",
        "(MDSTableClient::got_journaled_ack(unsigned long)+0x16b) [0x563b863b415b]",
        "(MDLog::_replay_thread()+0x753) [0x563b863ff003]",
        "ceph-mds(+0x141861) [0x563b86174861]",
        "/lib64/libc.so.6(+0x9f802) [0x7f0d21d18802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f0d21cb8450]" 
    ],

It does seem like its the s-r mds that's crashing and not the active mds. Milind, please take a look and verify.

Actions #3

Updated by Venky Shankar 7 months ago

  • Assignee set to Milind Changire
  • Priority changed from Normal to Urgent
  • Target version set to v19.0.0
  • Source changed from other to Q/A
  • Backport set to reef,quincy
  • Severity changed from 3 - minor to 1 - critical
  • Component(FS) MDS added
Actions #4

Updated by Milind Changire 7 months ago

  • Description updated (diff)
Actions #5

Updated by Venky Shankar 5 months ago

  • Related to Bug #54741: crash: MDSTableClient::got_journaled_ack(unsigned long) added
Actions #6

Updated by Venky Shankar 5 months ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF