Project

General

Profile

Actions

Bug #1022

closed

every mds crash: Program terminated with signal 11, Segmentation fault.

Added by ar Fred about 13 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A few seconds after startup, all my MDSes crash with the following message:

ceph version .commit: . process: cmds. pid: 6614
2011-04-21 14:53:04.928342 7fad87f50700 mds-1.0 ms_handle_connect on 192.168.20.9:6789/0
2011-04-21 14:53:09.040809 7fad87f50700 mds-1.0 handle_mds_map standby
2011-04-21 14:53:23.122717 7fad87f50700 mds-1.0 handle_mds_map standby
2011-04-21 14:53:38.194972 7fad87f50700 mds0.342 handle_mds_map i am now mds0.342
2011-04-21 14:53:38.194999 7fad87f50700 mds0.342 handle_mds_map state change up:standby --> up:replay
2011-04-21 14:53:38.195008 7fad87f50700 mds0.342 replay_start
2011-04-21 14:53:38.195026 7fad87f50700 mds0.342  recovery set is 
2011-04-21 14:53:38.195035 7fad87f50700 mds0.342  need osdmap epoch 2013, have 2010
2011-04-21 14:53:38.195077 7fad87f50700 mds0.cache handle_mds_failure mds0 : recovery peers are 
2011-04-21 14:53:38.330797 7fad87f50700 mds0.342 ms_handle_connect on 192.168.20.9:6801/3628
2011-04-21 14:53:38.331087 7fad87f50700 mds0.342 ms_handle_connect on 192.168.20.10:6801/3918
2011-04-21 14:53:38.331204 7fad87f50700 mds0.342 ms_handle_connect on 192.168.20.11:6801/3803
2011-04-21 14:53:38.373929 7fad87f50700 mds0.cache creating system inode with ino:100
2011-04-21 14:53:38.374091 7fad87f50700 mds0.cache creating system inode with ino:1
*** Caught signal (Segmentation fault) **
 in thread 0x7f93dabdb700
 ceph version  (commit:)
 1: /usr/bin/cmds() [0x73b691]
 2: (()+0xfc60) [0x7f93df9a3c60]
 3: (ESession::replay(MDS*)+0x6c6) [0x4ecdd6]
 4: (MDLog::_replay_thread()+0x10a1) [0x6935e1]
 5: (MDLog::ReplayThread::entry()+0xd) [0x4d86cd]
 6: (()+0x6d8c) [0x7f93df99ad8c]
 7: (clone()+0x6d) [0x7f93de5e804d]

I just upgraded to ceph v0.26 from ceph 0.24.3.
all OSD and all MON are running fine.

what gdb has to say:

(gdb) bt
#0  0x00007fad8a60db3b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000073b8a4 in handle_fatal_signal (signum=11) at common/signal.cc:78
#2  <signal handler called>
#3  ESession::replay (this=0x2a54900, mds=0x2a57a00) at mds/journal.cc:711
#4  0x00000000006935e1 in MDLog::_replay_thread (this=0x2a5a300) at mds/MDLog.cc:556
#5  0x00000000004d86cd in MDLog::ReplayThread::entry (this=<value optimized out>) at mds/MDLog.h:86
#6  0x00007fad8a604d8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007fad8925204d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x0000000000000000 in ?? ()

Actions #1

Updated by Sage Weil about 13 years ago

  • Assignee set to Sage Weil
  • Target version set to v0.27

This looks like damage from a bug in the session journaling. Can you dump a copy of your journal so we can take a closer look? cmds -h to see the syntax. you'll want to tar+bzip it before passing it along.

In the meantime, you should be able to recover your journal if you apply


diff --git a/src/mds/journal.cc b/src/mds/journal.cc
index 99fafdc..681dd2b 100644
--- a/src/mds/journal.cc
+++ b/src/mds/journal.cc
@@ -708,6 +708,7 @@ void ESession::replay(MDS *mds)
       dout(10) << " opened session " << session->inst << dendl;
     } else {
       session = mds->sessionmap.get_session(client_inst.name);
+      if (session) {
       if (session->connection == NULL) {
        mds->sessionmap.remove_session(session);
        dout(10) << " removed session " << session->inst << dendl;
@@ -715,6 +716,7 @@ void ESession::replay(MDS *mds)
        session->clear();    // the client has reconnected; keep the Session, but reset
        dout(10) << " reset session " << session->inst << " (they reconnected)" << dendl;
       }
+      }
     }
   }

Actions #2

Updated by Sage Weil about 13 years ago

  • Target version changed from v0.27 to v0.27.1
Actions #3

Updated by Sage Weil almost 13 years ago

  • Status changed from New to Can't reproduce

I think the trail is cold on this one. Let's keep an eye out for this in case it comes up again.

Actions #4

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.27.1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF