Project

General

Profile

Actions

Bug #47643

open

mds: Segmentation fault in thread 7fcff3078700 thread_name:md_log_replay

Added by Jan Fajerski over 3 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In ceph-14.2.11.394+g9cbbc473c0 (downstream build but mds sources are the same as v14.2.11) we got a report about the following segfault:

#bt
#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000561453080f20 in reraise_fatal (signum=11)
    at /usr/src/debug/ceph-14.2.11.394+g9cbbc473c0-3.47.2.x86_64/src/global/signal_handler.cc:81
#2  handle_fatal_signal (signum=11) at /usr/src/debug/ceph-14.2.11.394+g9cbbc473c0-3.47.2.x86_64/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x0000561455327200 in ?? ()
#5  0x0000561453063bd1 in EMetaBlob::replay (this=this@entry=0x56145533a328, mds=mds@entry=0x561456127008, 
    logseg=logseg@entry=0x561456284000, slaveup=slaveup@entry=0x0)
    at /usr/src/debug/ceph-14.2.11.394+g9cbbc473c0-3.47.2.x86_64/src/mds/journal.cc:1412
#6  0x00005614530684ac in EUpdate::replay (this=0x56145533a300, mds=0x561456127008)
    at /usr/src/debug/ceph-14.2.11.394+g9cbbc473c0-3.47.2.x86_64/src/mds/journal.cc:2087
#7  0x00005614530064b2 in MDLog::_replay_thread (this=0x561455fd2dc0)
    at /usr/src/debug/ceph-14.2.11.394+g9cbbc473c0-3.47.2.x86_64/src/mds/MDLog.cc:1452
#8  0x0000561452d73ded in MDLog::ReplayThread::entry (this=<optimized out>)
    at /usr/src/debug/ceph-14.2.11.394+g9cbbc473c0-3.47.2.x86_64/src/mds/MDLog.h:94
#9  0x00007fecf7eb54f9 in start_thread (arg=0x7fece766e700) at pthread_create.c:465
#10 0x00007fecf70b8fbf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
#frame 5
#info locals
__PRETTY_FUNCTION__ = "void EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)" 
renamed_diri = 0x56145628bc00
olddir = <optimized out>
unlinked = {_M_t = {
    _M_impl = {<std::allocator<std::_Rb_tree_node<std::pair<CInode* const, CDir*> > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<CInode* const, CDir*> > >> = {<No data fields>}, <No data fields>}, <std::_Rb_tree_key_compare<std::less<CInode*> >> = {
        _M_key_compare = {<std::binary_function<CInode*, CInode*, bool>> = {<No data fields>}, <No data fields>}}, <std::_Rb_tree_header> = {_M_header = {_M_color = std::_S_red, _M_parent = 0x561456135770, _M_left = 0x561456135770, _M_right = 0x561456135770}, 
        _M_node_count = 1}, <No data fields>}}}
linked = {_M_t = {
    _M_impl = {<std::allocator<std::_Rb_tree_node<CInode*> >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<CInode*> >> = {<No data fields>}, <No data fields>}, <std::_Rb_tree_key_compare<std::less<CInode*> >> = {
        _M_key_compare = {<std::binary_function<CInode*, CInode*, bool>> = {<No data fields>}, <No data fields>}}, <std::_Rb_tree_header> = {_M_header = {_M_color = std::_S_red, _M_parent = 0x5614561358c0, _M_left = 0x5614561358c0, _M_right = 0x5614561358c0}, 
        _M_node_count = 1}, <No data fields>}}}
count = 4
#x 0x56145628bc00
0x56145628bc00:    0x5628d800
#x 0x5628d800
0x5628d800:    Cannot access memory at address 0x5628d800
#

Log section with debug 10:

   -50> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000015dfa) mark_clean [inode 0x10000015dfa [2,head] #10000015dfa a
uth v86 s=1000602097 n(v0 rc2068-12-25 07:26:08.000000 b1000602097 1=1+0) (iversion lock) cr={16595069=0-2004877312@1} | dirty=1 0x560
da6966e00]
   -49> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x10000016962 .BnF-34070-60-20200910143745-00065-large_2020_01_fogg238
.bnf.fr.warc.gz.4pKIUU) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-34070-60-20200910143745-00065-large_2020_01_fogg238.fogg
282.12111.transfer/.BnF-34070-60-20200910143745-00065-large_2020_01_fogg238.bnf.fr.warc.gz.4pKIUU [2,head] auth NULL (dversion lock) v
=86 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a4b0c0]
   -48> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x10000016962) mark_clean [dir 0x10000016962 /dlweb/dta/24/warc_mab/00
1/BnF-34070-60-20200910143745-00065-large_2020_01_fogg238.fogg282.12111.transfer/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-09-14 1
6:50:35.665838 1=1+0) n(v1 rc2068-12-25 07:26:08.000000 b1000602097 1=1+0) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6962a00] version 87
   -47> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000016962) mark_clean [inode 0x10000016962 [...2,head] #1000001696
2/ auth v18930 f(v0 m2020-09-14 16:50:35.665838 1=1+0) n(v1 rc2068-12-25 07:26:08.000000 b1000602097 2=1+1) (iversion lock) | dirfrag=
0 dirty=1 0x560da6966700]
   -46> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000001652d BnF-34070-60-20200910143745-00065-large_2020_01_fogg238.
fogg282.12111.transfer) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-34070-60-20200910143745-00065-large_2020_01_fogg238.fogg
282.12111.transfer [2,head] auth NULL (dversion lock) v=18930 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a4aee0]
   -45> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6914d00) Checking dentry 0x560da5a49fe0
   -44> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6915200) [dir 0x10000015dfc /dlweb/dta/24/war
c_mab/001/BnF-33950-60-20200909213034-00003-large_2020_01_fogg241.fogg283.15167.transfer/ [2,head] auth v=130 cv=0/0 state=1610612736 
f(v0 m2020-09-14 16:54:02.442380 1=1+0) n(v1 rc2020-09-18 09:38:45.341598 b1000010126 1=1+0) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0
x560da6915200]
   -43> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6915200) Checking dentry 0x560da5a4a1c0
   -42> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000015dfd) mark_clean [inode 0x10000015dfd [2,head] #10000015dfd a
uth v129 s=1000010126 n(v0 rc2020-09-18 09:38:45.341598 b1000010126 1=1+0) (iversion lock) cr={16595069=0-2000683008@1} | dirty=1 0x56
0da695f500]
   -41> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x10000015dfc .BnF-33950-60-20200909213034-00003-large_2020_01_fogg241
.bnf.fr.warc.gz.KP5paz) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-33950-60-20200909213034-00003-large_2020_01_fogg241.fogg
283.15167.transfer/.BnF-33950-60-20200909213034-00003-large_2020_01_fogg241.bnf.fr.warc.gz.KP5paz [2,head] auth NULL (dversion lock) v
=129 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a4a1c0]
   -40> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x10000015dfc) mark_clean [dir 0x10000015dfc /dlweb/dta/24/warc_mab/00
1/BnF-33950-60-20200909213034-00003-large_2020_01_fogg241.fogg283.15167.transfer/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-09-14 1
6:54:02.442380 1=1+0) n(v1 rc2020-09-18 09:38:45.341598 b1000010126 1=1+0) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6915200] version 130
   -39> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000015dfc) mark_clean [inode 0x10000015dfc [...2,head] #10000015df
c/ auth v18926 f(v0 m2020-09-14 16:54:02.442380 1=1+0) n(v1 rc2020-09-18 09:38:45.341598 b1000010126 2=1+1) (iversion lock) | dirfrag=
0 dirty=1 0x560da695ee00]
   -38> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000001652d BnF-33950-60-20200909213034-00003-large_2020_01_fogg241.
fogg283.15167.transfer) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-33950-60-20200909213034-00003-large_2020_01_fogg241.fogg
283.15167.transfer [2,head] auth NULL (dversion lock) v=18926 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a49fe0]
   -37> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6914d00) Checking dentry 0x560da5a4b480
   -36> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6915700) [dir 0x10000016961 /dlweb/dta/24/war
c_mab/001/BnF-34120-60-20200910113049-00004-large_2020_01_fogg254/ [2,head] auth v=83 cv=0/0 state=1610612736 f(v0 m2020-09-18 09:40:1
4.258027 1=1+0) n(v3 rc2020-09-18 09:40:14.258027 b1008663617 1=1+0) hs=1+1,ss=0+0 dirty=2 | child=1 dirty=1 0x560da6915700]
   -35> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6915700) Checking dentry 0x560da5a4a580
   -34> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x10000016961 .BnF-34120-60-20200910113049-00004-large_2020_01_fogg254
.bnf.fr.warc.gz.NJOfyM) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-34120-60-20200910113049-00004-large_2020_01_fogg254/.BnF
-34120-60-20200910113049-00004-large_2020_01_fogg254.bnf.fr.warc.gz.NJOfyM [2,head] auth NULL (dversion lock) v=81 ino=(nil) state=161
0612800|bottomlru | inodepin=0 dirty=1 0x560da5a4a580]
   -33> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6915700) Checking dentry 0x560da5a4b2a0
   -32> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000015dfb) mark_clean [inode 0x10000015dfb [2,head] #10000015dfb a
uth v80 dirtyparent s=1008663617 n(v0 rc2020-09-18 09:40:14.258027 b1008663617 1=1+0)/n(v0 rc2020-09-18 09:40:14.256478 b1008663617 1=
1+0) (iversion lock) cr={16595069=0-2017460224@1} | dirtyparent=1 dirty=1 0x560da6960300]
   -31> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000015dfb) clear_dirty_parent
   -30> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x10000016961 BnF-34120-60-20200910113049-00004-large_2020_01_fogg254.
bnf.fr.warc.gz) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-34120-60-20200910113049-00004-large_2020_01_fogg254/BnF-34120-60
-20200910113049-00004-large_2020_01_fogg254.bnf.fr.warc.gz [2,head] auth NULL (dversion lock) v=80 ino=(nil) state=1610612736 | inodep
in=0 dirty=1 0x560da5a4b2a0]
   -29> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x10000016961) mark_clean [dir 0x10000016961 /dlweb/dta/24/warc_mab/00
1/BnF-34120-60-20200910113049-00004-large_2020_01_fogg254/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-09-18 09:40:14.258027 1=1+0) n
(v3 rc2020-09-18 09:40:14.258027 b1008663617 1=1+0) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6915700] version 83
   -28> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000016961) mark_clean [inode 0x10000016961 [...2,head] #1000001696
1/ auth v18942 dirtyparent f(v0 m2020-09-18 09:40:14.258027 1=1+0) n(v3 rc2020-09-18 09:40:14.306187 b1008663617 2=1+1)/n(v3 rc2020-09
-18 09:40:14.258866 b1008663617 2=1+1) (iversion lock) | dirfrag=0 dirtyparent=1 dirty=1 0x560da695fc00]
   -27> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000016961) clear_dirty_parent
   -26> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000001652d BnF-34120-60-20200910113049-00004-large_2020_01_fogg254)
 mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-34120-60-20200910113049-00004-large_2020_01_fogg254 [2,head] auth NULL (dversio
n lock) v=18942 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a4b480]
   -25> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6914d00) Checking dentry 0x560da5a4ab20
   -24> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6962500) [dir 0x10000015df6 /dlweb/dta/24/war
c_mab/001/BnF-34000-60-20200910021815-00015-large_2020_01_fogg231.fogg280.22244.transfer/ [2,head] auth v=80 cv=0/0 state=1610612736 f
(v0 m2020-09-14 16:50:35.634748 1=1+0) n(v1 rc2052-01-20 20:48:16.000000 b1000030065 1=1+0) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x
560da6962500]
   -23> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache trim_non_auth_subtree(0x560da6962500) Checking dentry 0x560da5a4ad00
   -22> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000015df8) mark_clean [inode 0x10000015df8 [2,head] #10000015df8 a
uth v79 s=1000030065 n(v0 rc2052-01-20 20:48:16.000000 b1000030065 1=1+0) (iversion lock) cr={16595069=0-2000683008@1} | dirty=1 0x560
da6966000]
   -21> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x10000015df6 .BnF-34000-60-20200910021815-00015-large_2020_01_fogg231
.bnf.fr.warc.gz.CV3ZRg) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-34000-60-20200910021815-00015-large_2020_01_fogg231.fogg
280.22244.transfer/.BnF-34000-60-20200910021815-00015-large_2020_01_fogg231.bnf.fr.warc.gz.CV3ZRg [2,head] auth NULL (dversion lock) v
=79 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a4ad00]
   -20> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x10000015df6) mark_clean [dir 0x10000015df6 /dlweb/dta/24/warc_mab/00
1/BnF-34000-60-20200910021815-00015-large_2020_01_fogg231.fogg280.22244.transfer/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-09-14 1
6:50:35.634748 1=1+0) n(v1 rc2052-01-20 20:48:16.000000 b1000030065 1=1+0) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6962500] version 80
   -19> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x10000015df6) mark_clean [inode 0x10000015df6 [...2,head] #10000015df
6/ auth v18929 f(v0 m2020-09-14 16:50:35.634748 1=1+0) n(v1 rc2052-01-20 20:48:16.000000 b1000030065 2=1+1) (iversion lock) | dirfrag=
0 dirty=1 0x560da6961800]
   -18> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000001652d BnF-34000-60-20200910021815-00015-large_2020_01_fogg231.
fogg280.22244.transfer) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001/BnF-34000-60-20200910021815-00015-large_2020_01_fogg231.fogg
280.22244.transfer [2,head] auth NULL (dversion lock) v=18929 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a4ab20]
   -17> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x1000001652d) mark_clean [dir 0x1000001652d /dlweb/dta/24/warc_mab/00
1/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-09-18 09:40:14.306187 139=0+139) n(v586 rc2105-03-17 09:21:20.000000 b139510542699 281
=142+139) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6914d00] version 18945
   -16> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x1000001652d) mark_clean [inode 0x1000001652d [...2,head] #1000001652
d/ auth v842121 f(v0 m2020-09-18 09:40:14.306187 139=0+139) n(v586 rc2105-03-17 09:21:20.000000 b139510542699 282=142+140) (iversion lock) | dirfrag=0 dirty=1 0x560da695e700]
   -15> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000000044b 001) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab/001 [2,head] auth NULL (dversion lock) v=842121 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a49e00]
   -14> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x1000000044b) mark_clean [dir 0x1000000044b /dlweb/dta/24/warc_mab/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-09-10 04:50:02.244017 1=0+1) n(v21828 rc2105-11-01 11:05:52.000000 b139510542699 282=142+140) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6914800] version 842122
   -13> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x1000000044b) mark_clean [inode 0x1000000044b [...2,head] #1000000044b/ auth v307393 f(v0 m2020-09-10 04:50:02.244017 1=0+1) n(v21828 rc2105-11-01 11:05:52.000000 b139510542699 283=142+141) (iversion lock) | dirfrag=0 dirty=1 0x560da695e000]
   -12> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000000000f warc_mab) mark_clean [dentry #0x1/dlweb/dta/24/warc_mab [2,head] auth NULL (dversion lock) v=307393 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a49c20]
   -11> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x1000000000f) mark_clean [dir 0x1000000000f /dlweb/dta/24/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-09-09 14:31:32.996262 4=0+4) n(v19259 rc2105-11-01 11:05:52.000000 b546844418307 792=639+153) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6914300] version 307394
   -10> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x1000000000f) mark_clean [inode 0x1000000000f [...2,head] #1000000000f/ auth v304548 snaprealm=0x560da59fe080 f(v0 m2020-09-09 14:31:32.996262 4=0+4) n(v19259 rc2105-11-01 11:05:52.000000 b546844418307 793=639+154) (iversion lock) | dirfrag=0 dirty=1 0x560da66e5800]
    -9> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000000000e 24) mark_clean [dentry #0x1/dlweb/dta/24 [2,head] auth NULL (dversion lock) v=304548 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a49a40]
    -8> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x1000000000e) mark_clean [dir 0x1000000000e /dlweb/dta/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-06-30 15:26:11.768092 1=0+1) n(v20037 rc2105-11-01 11:05:52.000000 b546844418307 793=639+154) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6913e00] version 304549
    -7> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x1000000000e) mark_clean [inode 0x1000000000e [...2,head] #1000000000e/ auth v293396 f(v0 m2020-06-30 15:26:11.768092 1=0+1) n(v20037 rc2105-11-01 11:05:52.000000 b546844418307 794=639+155) (iversion lock) | dirfrag=0 dirty=1 0x560da66e5100]
    -6> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1000000000d dta) mark_clean [dentry #0x1/dlweb/dta [2,head] auth NULL (dversion lock) v=293396 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a49860]
    -5> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.dir(0x1000000000d) mark_clean [dir 0x1000000000d /dlweb/ [2,head] rep@-2.0 state=536870912 f(v0 m2020-06-30 15:25:54.672057 1=0+1) n(v14361 rc2105-11-01 11:05:52.000000 b546844418307 794=639+155) hs=0+0,ss=0+0 | child=0 dirty=1 0x560da6913900] version 293397
    -4> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.ino(0x1000000000d) mark_clean [inode 0x1000000000d [...2,head] #1000000000d/ auth v293401 f(v0 m2020-06-30 15:25:54.672057 1=0+1) n(v14361 rc2105-11-01 11:05:52.000000 b546844418307 795=639+156) (iversion lock) | dirfrag=0 dirty=1 0x560da66e4a00]
    -3> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache.den(0x1 dlweb) mark_clean [dentry #0x1/dlweb [2,head] auth NULL (dversion lock) v=293401 ino=(nil) state=1610612736 | inodepin=0 dirty=1 0x560da5a49680]
    -2> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache |__ 0    auth [dir 0x100 ~mds0/ [2,head] auth v=191301 cv=0/0 dir_auth=0 state=1073741824 f(v0 10=0+10) n(v93 rc2105-11-01 11:05:52.000000 10=0+10) hs=0+0,ss=0+0 | subtree=1 0x560da6912f00]
    -1> 2020-09-23 15:26:13.339 7f323739e700 10 mds.0.cache |__-2     rep [dir 0x1 / [2,head] rep@-2.0 dir_auth=-2 state=536870912 f(v0 m2020-09-15 09:37:15.745370 3=1+2) n(v16115 rc2105-11-01 11:05:52.000000 b547976890111 957=787+170) hs=0+0,ss=0+0 | child=0 subtree=1 dirty=1 0x560da6913400]
     0> 2020-09-23 15:26:13.339 7f323739e700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f323739e700 thread_name:md_log_replay

 ceph version 14.2.11-394-g9cbbc473c0 (9cbbc473c02686761b4d27bfd134215209f85d2f) nautilus (stable)
 1: (()+0x132d0) [0x7f3247bf02d0]
 2: [0x560da59eb2e0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
  10/10 mds
  10/10 mds_balancer
  10/10 mds_locker
  10/10 mds_log
  10/10 mds_log_expire
  10/10 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.gondor-m02.log
--- end dump of recent events ---

Actions #1

Updated by Patrick Donnelly over 3 years ago

  • Subject changed from Segmentation fault in thread 7fcff3078700 thread_name:md_log_replay to mds: Segmentation fault in thread 7fcff3078700 thread_name:md_log_replay
  • Status changed from New to Need More Info

#x 0x5628d800

I'm not sure this double-deref is indicating anything. Are you sure that's a pointer? Would you not want:

print *renamed_diri

or (if that doesn't work)

print (CInode)renamed_diri

?

What was the state of this cluster and workload? It could be this is a symptom of running out of memory.

Actions #2

Updated by Jan Fajerski over 3 years ago

  • Assignee set to Yehuda Sadeh
  • Severity deleted (3 - minor)

Patrick Donnelly wrote:

#x 0x5628d800

I'm not sure this double-deref is indicating anything. Are you sure that's a pointer? Would you not want:

Right, apologies for the noise.

The CInode output is pretty huge, do we need all of it here?

I had a look what CINode::authority does with this CInode:

#print renamed_diri->inode_auth.first
$15 = -1
#print renamed_diri->parent
$16 = (CDentry *) 0x0
#print renamed_diri->projected_parent
$17 = {<std::__cxx11::_List_base<CDentry*, mempool::pool_allocator<(mempool::pool_index_t)18, CDentry*> >> = {
    _M_impl = {<mempool::pool_allocator<(mempool::pool_index_t)18, std::_List_node<CDentry*> >> = {pool = 0x7fed01149200, 
        type = 0x0}, _M_node = {<std::__detail::_List_node_base> = {_M_next = 0x56145628bf88, _M_prev = 0x56145628bf88}, 
        _M_storage = {_M_storage = "\000\000\000\000\000\000\000"}}}}, <No data fields>}

What was the state of this cluster and workload? It could be this is a symptom of running out of memory.

The cluster seems reasonably healthy:

  cluster:
    id:     09501e61-ac58-4631-894d-843071a04a6d
    health: HEALTH_WARN
            1 filesystem is degraded
            insufficient standby MDS daemons available
            BlueFS spillover detected on 1 OSD(s)
            10 daemons have recently crashed

The crashed daemons would be the MDS' iiuc.
It doesn't seem to be OOM, at least the hosts logs don't mention anything. The MDS hosts also run MONs and MGRs.

No idea about the workload yet (query is pending) but it looks like the MDS cluster has been shutdown for a couple of days prior to this. I requested a timeline of events to confirm this.

Actions #3

Updated by Jan Fajerski over 3 years ago

  • Assignee deleted (Yehuda Sadeh)
Actions #4

Updated by Tomasz Kuzemko over 1 year ago

I've ran into a very similar issue to this one on 17.2.0. I tried to upgrade MDS to 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) but still get same results. I have it fully reproducible on my test cluster. I can provide additional information if needed.

It is unclear to me what led to this bug starting to happen. Few days ago I was running some tests under tight memory conditions on this cluster during which multiple OSDs OOMed. This seemed to have resulted in some client lock issues and I had to delete a few CephFS volumes. I didn't do anything with this FS in the meantime and suddenly MDS started to crash, and I saw this segfault in logs. I tried to ceph mds fail and recreate all MDS daemons but this didn't help.

Below is a log snippet taken with debug_mds = 20:

    -8> 2022-08-22T22:09:31.537+0000 7f595898a700 10 mds.0.cache.dir(0x10000000000) mark_clean [dir 0x10000000000 /volumes/ [2,head] rep@-2.0 state=536870912 f(v0 m2022-08-19T09:34:21.639654+0000 5=2+3) n(v39 rc2022-08-19T09:34:21.639654+0000 b1246127868 149=135+14) hs=0+0,ss=0+0 | child=0 dirty=1 0x5621954bda80] version 368
    -7> 2022-08-22T22:09:31.537+0000 7f595898a700 20 mds.0.cache trim_non_auth_subtree(0x5621954bd600) removing inode 0x56219637c100 with dentry0x56219550db80
    -6> 2022-08-22T22:09:31.537+0000 7f595898a700 12 mds.0.cache.dir(0x1) unlink_inode [dentry #0x1/volumes [2,head] auth (dversion lock) v=0 ino=0x10000000000 state=1073741824 0x56219550db80] [inode 0x10000000000 [...6,head] /volumes/ auth v4922 f(v0 m2022-08-19T09:34:21.639654+0000 5=2+3) n(v39 rc2022-08-19T09:34:21.639654+0000 b1246127868 150=135+15) old_inodes=4 (iversion lock) 0x56219637c100]
    -5> 2022-08-22T22:09:31.537+0000 7f595898a700 14 mds.0.cache remove_inode [inode 0x10000000000 [...6,head] #10000000000/ auth v4922 f(v0 m2022-08-19T09:34:21.639654+0000 5=2+3) n(v39 rc2022-08-19T09:34:21.639654+0000 b1246127868 150=135+15) old_inodes=4 (iversion lock) 0x56219637c100]
    -4> 2022-08-22T22:09:31.537+0000 7f595898a700 12 mds.0.cache.dir(0x1) remove_dentry [dentry #0x1/volumes [2,head] auth NULL (dversion lock) v=0 ino=(nil) state=1073741824 0x56219550db80]
    -3> 2022-08-22T22:09:31.537+0000 7f595898a700 15 mds.0.cache show_subtrees
    -2> 2022-08-22T22:09:31.537+0000 7f595898a700 10 mds.0.cache |__ 0    auth [dir 0x100 ~mds0/ [2,head] auth v=455 cv=0/0 dir_auth=0 state=1073741824 f(v0 10=0+10) n(v16 rc2022-08-19T08:45:28.143849+0000 10=0+10) hs=0+0,ss=0+0 | subtree=1 0x5621954bd180]
    -1> 2022-08-22T22:09:31.537+0000 7f595898a700 10 mds.0.cache |__-2     rep [dir 0x1 / [2,head] rep@-2.0 dir_auth=-2 state=0 f(v0 m2022-08-17T22:20:24.333313+0000 2=0+2) n(v59 rc2022-08-19T09:34:21.639654+0000 b1418513641 285=264+21) hs=0+0,ss=0+0 | child=0 subtree=1 0x5621954bd600]
     0> 2022-08-22T22:09:31.544+0000 7f595898a700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f595898a700 thread_name:md_log_replay

 ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f59679abce0]
 2: /usr/lib64/ceph/libceph-common.so.2(+0xe03e00) [0x7f596954be00]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions

Also available in: Atom PDF