Project

General

Profile

Actions

Bug #312

closed

MDS crash: LogSegment::try_to_expire(MDS*)

Added by Wido den Hollander almost 14 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This morning i upgraded my cluster to the latest unstable, afterwards i tried to mount the cluster, which failed.

While mounting i saw that my MDS'es crashed, both with almost the same backtrace:

mds0

Core was generated by `/usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000621374 in LogSegment::try_to_expire(MDS*) ()
(gdb) bt
#0  0x0000000000621374 in LogSegment::try_to_expire(MDS*) ()
#1  0x000000000061b06d in MDLog::try_expire(LogSegment*) ()
#2  0x000000000061bcc0 in MDLog::trim(int) ()
#3  0x000000000049553a in MDS::tick() ()
#4  0x000000000069bfb9 in SafeTimer::EventWrapper::finish(int) ()
#5  0x000000000069e3bc in Timer::timer_entry() ()
#6  0x0000000000474ebd in Timer::TimerThread::entry() ()
#7  0x0000000000487c2a in Thread::_entry_func(void*) ()
#8  0x00007ff8fee5f9ca in start_thread () from /lib/libpthread.so.0
#9  0x00007ff8fe07f6cd in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb)

mds1

Core was generated by `/usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  CDentry::get_dir (this=0x94e9b0, mds=0x1476330) at mds/events/../CDentry.h:200
200    mds/events/../CDentry.h: No such file or directory.
    in mds/events/../CDentry.h
(gdb) bt
#0  CDentry::get_dir (this=0x94e9b0, mds=0x1476330) at mds/events/../CDentry.h:200
#1  LogSegment::try_to_expire (this=0x94e9b0, mds=0x1476330) at mds/journal.cc:105
#2  0x000000000061b06d in MDLog::try_expire (this=0x1475580, ls=0x2689810) at mds/MDLog.cc:363
#3  0x000000000061bcc0 in MDLog::trim (this=0x1475580, m=<value optimized out>) at mds/MDLog.cc:355
#4  0x000000000049553a in MDS::tick (this=0x1476330) at mds/MDS.cc:513
#5  0x000000000069bfb9 in SafeTimer::EventWrapper::finish (this=0x7fadc44bd780, r=0) at common/Timer.cc:295
#6  0x000000000069e3bc in Timer::timer_entry (this=0x1476378) at common/Timer.cc:100
#7  0x0000000000474ebd in Timer::TimerThread::entry (this=<value optimized out>) at ./common/Timer.h:77
#8  0x0000000000487c2a in Thread::_entry_func (arg=0x94e9b0) at ./common/Thread.h:39
#9  0x00007fadcd0eb9ca in start_thread () from /lib/libpthread.so.0
#10 0x00007fadcc30a6fd in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb)

For mds1 i upped the loglevel to 20 to see what the last entries are.

The corefiles, binaries and logfiles are uploaded to logger.ceph.widodh.nl in the directory /srv/ceph/issues/cmds_crash_logsegment_try_to_expire

The most relevant files are:
  • core.cmds.node13.9718 (Last crash of mds0, debug at 20)
  • core.cmds.node14.11525 (Last crash with debug at 20 for mds.1.log)
  • mds.1.log (With debug at 20)
  • mds.0.log (With debug at 20)

I've preserved the timestamps of the corefiles (the two posted above), so you can compare them with the logfiles.

Nothing weird happend, last night i did a sync of kernel.org which went fine, then this morning (a few hours later) i upgraded to the last unstable.

Actions

Also available in: Atom PDF