Bug #165
closedcmds crash
0%
Description
one of my 3 mds crashed quickly after startup of the whole cluster:
this is using latest unstable (00c3dafd5afe6461fe3f0c09dcf4b5b585a740fc)
Core was generated by `/usr/bin/cmds i r1-10 -c /tmp/fetched.ceph.conf.1155'.>get_state() 2 || lock->get_state() 15 || lock->get_state() == 21", file=<value optimized out>,
Program terminated with signal 6, Aborted.
#0 0x00007fbcb188da75 in GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 0x00007fbcb188da75 in *_GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007fbcb18915c0 in *_GI_abort () at abort.c:92
#2 0x00007fbcb1886941 in *__GI_assert_fail (assertion=0x6d7d80 "lock
line=3468, function=0x6d85c0 "void Locker::handle_file_lock(ScatterLock, MLock*)") at assert.c:81
#3 0x0000000000574a0b in Locker::handle_file_lock (this=0xcd0b50, lock=0xce7570, m=0x7fbc9c000a30) at mds/Locker.cc:3468
#4 0x000000000049ef0d in MDS::_dispatch (this=0xcd2da0, m=0x7fbc9c000a30) at mds/MDS.cc:1427
#5 0x000000000049f3dd in MDS::ms_dispatch (this=0xcd2da0, m=0x7fbc9c000a30) at mds/MDS.cc:1281
#6 0x0000000000480309 in Messenger::ms_deliver_dispatch (this=<value optimized out>) at msg/Messenger.h:97
#7 SimpleMessenger::dispatch_entry (this=<value optimized out>) at msg/SimpleMessenger.cc:332
#8 0x0000000000473bac in SimpleMessenger::DispatchThread::entry (this=0xcd6760) at msg/SimpleMessenger.h:494
#9 0x00000000004850ea in Thread::_entry_func (arg=0x4e8) at ./common/Thread.h:39
#10 0x00007fbcb27209ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#11 0x00007fbcb19406cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()
Files
Updated by ar Fred almost 14 years ago
A bit later, I restarted the whole cluster, mds0 and mds2 crashed with the same stack trace, mds1 was fine.
Updated by Sage Weil almost 14 years ago
ar Fred wrote:
A bit later, I restarted the whole cluster, mds0 and mds2 crashed with the same stack trace, mds1 was fine.
This looks like a subtle lock state inconsistency from the rejoin (mds restart) process. Are those two nodes still down? If you can reproduce this (or a related) crash after adding 'debug mds = 20' to [mds] that would be awesome!
Updated by Sage Weil almost 14 years ago
- Status changed from New to 7
I pushed a fix to unstable that might fix the root cause of this, but it's hard to say. Can you leave 'debug mds = 20' in there, and restart all mds's, and if this comes back send the whole set of mds logs (full log, gzipped)? Or if the problem is fixed, even better.
Thanks!
Updated by ar Fred almost 14 years ago
- File mds0.log.gz mds0.log.gz added
- File mds1.log.gz mds1.log.gz added
- File mds2.log.gz mds2.log.gz added
just got the same crash of mds2 using b441fbdc9fdca271ed3bd100fc3c98c800b509b1
please find the full logs of each mds attached, mds2.log is the log for the crashed mds
Updated by Sage Weil almost 14 years ago
This looks an awful lot like it might be fixed by commit:15c6651ff57b88722b5c896f5698bf1d033e1f98. And possibly previous instances of that crash you saw were caused by that (that's the case that came up for me during testing).
But in the second case (the attached full logs) it was something else, which is now fixed by commit:15c6651ff57b88722b5c896f5698bf1d033e1f98.
Please give the latest unstable a test and let me know!
Updated by ar Fred almost 14 years ago
Indeed, can't reproduce the crash with the latest unstable.
I did 3-4 restart of all mds and it worked fine, that's the first time in a while I can see 3 mds up:active.
Thanks!
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.