Actions
Bug #1022
closedevery mds crash: Program terminated with signal 11, Segmentation fault.
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
A few seconds after startup, all my MDSes crash with the following message:
ceph version .commit: . process: cmds. pid: 6614 2011-04-21 14:53:04.928342 7fad87f50700 mds-1.0 ms_handle_connect on 192.168.20.9:6789/0 2011-04-21 14:53:09.040809 7fad87f50700 mds-1.0 handle_mds_map standby 2011-04-21 14:53:23.122717 7fad87f50700 mds-1.0 handle_mds_map standby 2011-04-21 14:53:38.194972 7fad87f50700 mds0.342 handle_mds_map i am now mds0.342 2011-04-21 14:53:38.194999 7fad87f50700 mds0.342 handle_mds_map state change up:standby --> up:replay 2011-04-21 14:53:38.195008 7fad87f50700 mds0.342 replay_start 2011-04-21 14:53:38.195026 7fad87f50700 mds0.342 recovery set is 2011-04-21 14:53:38.195035 7fad87f50700 mds0.342 need osdmap epoch 2013, have 2010 2011-04-21 14:53:38.195077 7fad87f50700 mds0.cache handle_mds_failure mds0 : recovery peers are 2011-04-21 14:53:38.330797 7fad87f50700 mds0.342 ms_handle_connect on 192.168.20.9:6801/3628 2011-04-21 14:53:38.331087 7fad87f50700 mds0.342 ms_handle_connect on 192.168.20.10:6801/3918 2011-04-21 14:53:38.331204 7fad87f50700 mds0.342 ms_handle_connect on 192.168.20.11:6801/3803 2011-04-21 14:53:38.373929 7fad87f50700 mds0.cache creating system inode with ino:100 2011-04-21 14:53:38.374091 7fad87f50700 mds0.cache creating system inode with ino:1 *** Caught signal (Segmentation fault) ** in thread 0x7f93dabdb700 ceph version (commit:) 1: /usr/bin/cmds() [0x73b691] 2: (()+0xfc60) [0x7f93df9a3c60] 3: (ESession::replay(MDS*)+0x6c6) [0x4ecdd6] 4: (MDLog::_replay_thread()+0x10a1) [0x6935e1] 5: (MDLog::ReplayThread::entry()+0xd) [0x4d86cd] 6: (()+0x6d8c) [0x7f93df99ad8c] 7: (clone()+0x6d) [0x7f93de5e804d]
I just upgraded to ceph v0.26 from ceph 0.24.3.
all OSD and all MON are running fine.
what gdb has to say:
(gdb) bt #0 0x00007fad8a60db3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x000000000073b8a4 in handle_fatal_signal (signum=11) at common/signal.cc:78 #2 <signal handler called> #3 ESession::replay (this=0x2a54900, mds=0x2a57a00) at mds/journal.cc:711 #4 0x00000000006935e1 in MDLog::_replay_thread (this=0x2a5a300) at mds/MDLog.cc:556 #5 0x00000000004d86cd in MDLog::ReplayThread::entry (this=<value optimized out>) at mds/MDLog.h:86 #6 0x00007fad8a604d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #7 0x00007fad8925204d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #8 0x0000000000000000 in ?? ()
Updated by Sage Weil about 13 years ago
- Assignee set to Sage Weil
- Target version set to v0.27
This looks like damage from a bug in the session journaling. Can you dump a copy of your journal so we can take a closer look? cmds -h to see the syntax. you'll want to tar+bzip it before passing it along.
In the meantime, you should be able to recover your journal if you apply
diff --git a/src/mds/journal.cc b/src/mds/journal.cc index 99fafdc..681dd2b 100644 --- a/src/mds/journal.cc +++ b/src/mds/journal.cc @@ -708,6 +708,7 @@ void ESession::replay(MDS *mds) dout(10) << " opened session " << session->inst << dendl; } else { session = mds->sessionmap.get_session(client_inst.name); + if (session) { if (session->connection == NULL) { mds->sessionmap.remove_session(session); dout(10) << " removed session " << session->inst << dendl; @@ -715,6 +716,7 @@ void ESession::replay(MDS *mds) session->clear(); // the client has reconnected; keep the Session, but reset dout(10) << " reset session " << session->inst << " (they reconnected)" << dendl; } + } } }
Updated by Sage Weil about 13 years ago
- Target version changed from v0.27 to v0.27.1
Updated by Sage Weil almost 13 years ago
- Status changed from New to Can't reproduce
I think the trail is cold on this one. Let's keep an eye out for this in case it comes up again.
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1) - Target version deleted (
v0.27.1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.
Actions