Project

General

Profile

Bug #312

MDS crash: LogSegment::try_to_expire(MDS*)

Added by Wido den Hollander over 13 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This morning i upgraded my cluster to the latest unstable, afterwards i tried to mount the cluster, which failed.

While mounting i saw that my MDS'es crashed, both with almost the same backtrace:

mds0

Core was generated by `/usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000621374 in LogSegment::try_to_expire(MDS*) ()
(gdb) bt
#0  0x0000000000621374 in LogSegment::try_to_expire(MDS*) ()
#1  0x000000000061b06d in MDLog::try_expire(LogSegment*) ()
#2  0x000000000061bcc0 in MDLog::trim(int) ()
#3  0x000000000049553a in MDS::tick() ()
#4  0x000000000069bfb9 in SafeTimer::EventWrapper::finish(int) ()
#5  0x000000000069e3bc in Timer::timer_entry() ()
#6  0x0000000000474ebd in Timer::TimerThread::entry() ()
#7  0x0000000000487c2a in Thread::_entry_func(void*) ()
#8  0x00007ff8fee5f9ca in start_thread () from /lib/libpthread.so.0
#9  0x00007ff8fe07f6cd in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb)

mds1

Core was generated by `/usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  CDentry::get_dir (this=0x94e9b0, mds=0x1476330) at mds/events/../CDentry.h:200
200    mds/events/../CDentry.h: No such file or directory.
    in mds/events/../CDentry.h
(gdb) bt
#0  CDentry::get_dir (this=0x94e9b0, mds=0x1476330) at mds/events/../CDentry.h:200
#1  LogSegment::try_to_expire (this=0x94e9b0, mds=0x1476330) at mds/journal.cc:105
#2  0x000000000061b06d in MDLog::try_expire (this=0x1475580, ls=0x2689810) at mds/MDLog.cc:363
#3  0x000000000061bcc0 in MDLog::trim (this=0x1475580, m=<value optimized out>) at mds/MDLog.cc:355
#4  0x000000000049553a in MDS::tick (this=0x1476330) at mds/MDS.cc:513
#5  0x000000000069bfb9 in SafeTimer::EventWrapper::finish (this=0x7fadc44bd780, r=0) at common/Timer.cc:295
#6  0x000000000069e3bc in Timer::timer_entry (this=0x1476378) at common/Timer.cc:100
#7  0x0000000000474ebd in Timer::TimerThread::entry (this=<value optimized out>) at ./common/Timer.h:77
#8  0x0000000000487c2a in Thread::_entry_func (arg=0x94e9b0) at ./common/Thread.h:39
#9  0x00007fadcd0eb9ca in start_thread () from /lib/libpthread.so.0
#10 0x00007fadcc30a6fd in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb)

For mds1 i upped the loglevel to 20 to see what the last entries are.

The corefiles, binaries and logfiles are uploaded to logger.ceph.widodh.nl in the directory /srv/ceph/issues/cmds_crash_logsegment_try_to_expire

The most relevant files are:
  • core.cmds.node13.9718 (Last crash of mds0, debug at 20)
  • core.cmds.node14.11525 (Last crash with debug at 20 for mds.1.log)
  • mds.1.log (With debug at 20)
  • mds.0.log (With debug at 20)

I've preserved the timestamps of the corefiles (the two posted above), so you can compare them with the logfiles.

Nothing weird happend, last night i did a sync of kernel.org which went fine, then this morning (a few hours later) i upgraded to the last unstable.

History

#1 Updated by Greg Farnum over 13 years ago

  • Status changed from New to In Progress
  • Assignee set to Greg Farnum

#2 Updated by Greg Farnum over 13 years ago

  • Status changed from In Progress to Can't reproduce

Unfortunately we can't solve the root cause of this because we don't have detailed logging when the inodes in question were created. :(

But commit:05a18e20b79285caaeb326a58d32a6a45d345998 will make it easier to notice when it happens and should prevent the actual replay crash, which was an unnecessary symptom due to lazy handling of the error.

Please re-open if you see any notices about "EMetaBlob.replay FIXME had dentry linked to wrong inode" in the central log (especially if you have a high level of debugging available for the inodes in question)! :)

#3 Updated by Wido den Hollander over 13 years ago

I just saw this crash again, uploaded the log (mds.0.log.1.gz) and corefile (core.cmds.node13.18006) to logger.ceph.widodh.nl in /srv/ceph/issues/cmds_crash_logsegment_try_to_expire/

I preserved the timestamps of the files, so this would give you a good idea.

In my log it was showing:

10.08.02_21:20:06.194577 7faf79bf0710 log [WRN] : EMetaBlob.replay FIXME had dentry linked to wrong inode [dentry #1/static/kernel/scm/linux/kernel/git/hadi/batch-net26.git/refs/heads [4,head] auth (dn sync) (dversion lock) v=2 inode=0x7faf620b9e50 | dirty 0x7faf5fcdea40] [inode 100000c6115 [...2,head] /static/kernel/scm/linux/kernel/git/hadi/batch-net26.git/refs/heads/ auth v654 f(v0 m10.07.17_23:38:13.622505 1=1+0) n(v0 rc10.07.17_23:38:13.622505 b41 2=1+1) (iauth sync) (ilink sync) (idft sync) (isnap sync) (inest sync) (ifile sync) (ixattr sync) (iversion lock) 0x7faf620b9e50] should be 1000017516e
10.08.02_21:20:06.194712 7faf79bf0710 log [WRN] : EMetaBlob.replay FIXME had dentry linked to wrong inode [dentry #1/static/kernel/scm/linux/kernel/git/hadi/batch-net26.git/refs/remotes [4,head] auth (dn sync) (dversion lock) v=4 inode=0x7faf620ba4a0 | inodepin dirty 0x7faf5fcdec10] [inode 100000c6116 [...2,head] /static/kernel/scm/linux/kernel/git/hadi/batch-net26.git/refs/remotes/ auth v656 f(v0 m10.07.17_23:37:49.092259 1=0+1) n(v2 rc10.07.17_23:38:13.702463 b73 4=2+2) (iauth sync) (ilink sync) (idft sync) (isnap sync) (inest sync) (ifile sync) (ixattr sync) (iversion lock) | dirfrag 0x7faf620ba4a0] should be 1000017516f

It crashed somewhere last night (21:20), at that moment i tried to mount the filesystem, but this kept failing with "Mount Error = 5 Input/Output Error"

Tried restarting the mon's and mds's a few times to see if that would fix it, but it didn't.

For now i hope you have enough to go on an fix this issue.

#4 Updated by Wido den Hollander over 13 years ago

I just hit this bug again.

I've uploaded the new logfiles + corefiles to logger.ceph.widodh.nl:/srv/ceph/issues/cmds_crash_logsegment_try_to_expire

The timestamps were preserved when uploading them, so you should exactly be able to see what happend. mds0 on node13 crashed first, then a few seconds later mds1 on node14 crashed.

This crash happend again after upgrading to the latest unstable ( 85814d45de05bcca390c39df020a7934e437c064 )

#5 Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Also available in: Atom PDF