Bug #165: cmds crash - CephFS - Ceph

Actions

Copy link

Bug #165

closed

cmds crash

Added by ar Fred almost 14 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

one of my 3 mds crashed quickly after startup of the whole cluster:
this is using latest unstable (00c3dafd5afe6461fe3f0c09dcf4b5b585a740fc)

Core was generated by `/usr/bin/cmds i r1-10 -c /tmp/fetched.ceph.conf.1155'.
Program terminated with signal 6, Aborted.
#0 0x00007fbcb188da75 in GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 0x00007fbcb188da75 in *_GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007fbcb18915c0 in *_GI_abort () at abort.c:92
#2 0x00007fbcb1886941 in *__GI_assert_fail (assertion=0x6d7d80 "lock>get_state() 2 || lock->get_state() 15 || lock->get_state() == 21", file=<value optimized out>,
line=3468, function=0x6d85c0 "void Locker::handle_file_lock(ScatterLock, MLock*)") at assert.c:81
#3 0x0000000000574a0b in Locker::handle_file_lock (this=0xcd0b50, lock=0xce7570, m=0x7fbc9c000a30) at mds/Locker.cc:3468
#4 0x000000000049ef0d in MDS::_dispatch (this=0xcd2da0, m=0x7fbc9c000a30) at mds/MDS.cc:1427
#5 0x000000000049f3dd in MDS::ms_dispatch (this=0xcd2da0, m=0x7fbc9c000a30) at mds/MDS.cc:1281
#6 0x0000000000480309 in Messenger::ms_deliver_dispatch (this=<value optimized out>) at msg/Messenger.h:97
#7 SimpleMessenger::dispatch_entry (this=<value optimized out>) at msg/SimpleMessenger.cc:332
#8 0x0000000000473bac in SimpleMessenger::DispatchThread::entry (this=0xcd6760) at msg/SimpleMessenger.h:494
#9 0x00000000004850ea in Thread::_entry_func (arg=0x4e8) at ./common/Thread.h:39
#10 0x00007fbcb27209ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#11 0x00007fbcb19406cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()

Files

Download all files

mds1 (30.1 KB) mds1	log of the crashing mds	ar Fred, 05/31/2010 03:12 AM
mds1.log_2 (188 KB) mds1.log_2	reproduced with debug mds = 20	ar Fred, 06/01/2010 01:32 PM
mds0.log_2 (107 KB) mds0.log_2		ar Fred, 06/01/2010 01:54 PM
mds0.log.gz (29.6 KB) mds0.log.gz		ar Fred, 06/02/2010 01:31 AM
mds1.log.gz (182 KB) mds1.log.gz		ar Fred, 06/02/2010 01:31 AM
mds2.log.gz (20.5 KB) mds2.log.gz	This is the log of the crashed mds	ar Fred, 06/02/2010 01:31 AM

Actions

Copy link

Updated by ar Fred almost 14 years ago

File mds1 mds1 added

Actions

Copy link

Updated by ar Fred almost 14 years ago

A bit later, I restarted the whole cluster, mds0 and mds2 crashed with the same stack trace, mds1 was fine.

Actions

Copy link

Updated by Sage Weil almost 14 years ago

ar Fred wrote:

A bit later, I restarted the whole cluster, mds0 and mds2 crashed with the same stack trace, mds1 was fine.

This looks like a subtle lock state inconsistency from the rejoin (mds restart) process. Are those two nodes still down? If you can reproduce this (or a related) crash after adding 'debug mds = 20' to [mds] that would be awesome!

Actions

Copy link

Updated by ar Fred almost 14 years ago

File mds1.log_2 mds1.log_2 added

Actions

Copy link

Updated by ar Fred almost 14 years ago

File mds0.log_2 mds0.log_2 added

Actions

Copy link

Updated by Sage Weil almost 14 years ago

Status changed from New to 7

I pushed a fix to unstable that might fix the root cause of this, but it's hard to say. Can you leave 'debug mds = 20' in there, and restart all mds's, and if this comes back send the whole set of mds logs (full log, gzipped)? Or if the problem is fixed, even better.

Thanks!

Actions

Copy link Download all files

Updated by ar Fred almost 14 years ago

File mds0.log.gz mds0.log.gz added
File mds1.log.gz mds1.log.gz added
File mds2.log.gz mds2.log.gz added

just got the same crash of mds2 using b441fbdc9fdca271ed3bd100fc3c98c800b509b1

please find the full logs of each mds attached, mds2.log is the log for the crashed mds

Actions

Copy link

Updated by Sage Weil almost 14 years ago

This looks an awful lot like it might be fixed by commit:15c6651ff57b88722b5c896f5698bf1d033e1f98. And possibly previous instances of that crash you saw were caused by that (that's the case that came up for me during testing).

But in the second case (the attached full logs) it was something else, which is now fixed by commit:15c6651ff57b88722b5c896f5698bf1d033e1f98.

Please give the latest unstable a test and let me know!

Actions

Copy link