Project

General

Profile

Actions

Bug #165

closed

cmds crash

Added by ar Fred almost 14 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

one of my 3 mds crashed quickly after startup of the whole cluster:
this is using latest unstable (00c3dafd5afe6461fe3f0c09dcf4b5b585a740fc)

Core was generated by `/usr/bin/cmds i r1-10 -c /tmp/fetched.ceph.conf.1155'.
Program terminated with signal 6, Aborted.
#0 0x00007fbcb188da75 in GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 0x00007fbcb188da75 in *_GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007fbcb18915c0 in *
_GI_abort () at abort.c:92
#2 0x00007fbcb1886941 in *__GI
_assert_fail (assertion=0x6d7d80 "lock
>get_state() 2 || lock->get_state() 15 || lock->get_state() == 21", file=<value optimized out>,
line=3468, function=0x6d85c0 "void Locker::handle_file_lock(ScatterLock
, MLock*)") at assert.c:81
#3 0x0000000000574a0b in Locker::handle_file_lock (this=0xcd0b50, lock=0xce7570, m=0x7fbc9c000a30) at mds/Locker.cc:3468
#4 0x000000000049ef0d in MDS::_dispatch (this=0xcd2da0, m=0x7fbc9c000a30) at mds/MDS.cc:1427
#5 0x000000000049f3dd in MDS::ms_dispatch (this=0xcd2da0, m=0x7fbc9c000a30) at mds/MDS.cc:1281
#6 0x0000000000480309 in Messenger::ms_deliver_dispatch (this=<value optimized out>) at msg/Messenger.h:97
#7 SimpleMessenger::dispatch_entry (this=<value optimized out>) at msg/SimpleMessenger.cc:332
#8 0x0000000000473bac in SimpleMessenger::DispatchThread::entry (this=0xcd6760) at msg/SimpleMessenger.h:494
#9 0x00000000004850ea in Thread::_entry_func (arg=0x4e8) at ./common/Thread.h:39
#10 0x00007fbcb27209ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#11 0x00007fbcb19406cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()


Files

mds1 (30.1 KB) mds1 log of the crashing mds ar Fred, 05/31/2010 03:12 AM
mds1.log_2 (188 KB) mds1.log_2 reproduced with debug mds = 20 ar Fred, 06/01/2010 01:32 PM
mds0.log_2 (107 KB) mds0.log_2 ar Fred, 06/01/2010 01:54 PM
mds0.log.gz (29.6 KB) mds0.log.gz ar Fred, 06/02/2010 01:31 AM
mds1.log.gz (182 KB) mds1.log.gz ar Fred, 06/02/2010 01:31 AM
mds2.log.gz (20.5 KB) mds2.log.gz This is the log of the crashed mds ar Fred, 06/02/2010 01:31 AM
Actions #1

Updated by ar Fred almost 14 years ago

Actions #2

Updated by ar Fred almost 14 years ago

A bit later, I restarted the whole cluster, mds0 and mds2 crashed with the same stack trace, mds1 was fine.

Actions #3

Updated by Sage Weil almost 14 years ago

ar Fred wrote:

A bit later, I restarted the whole cluster, mds0 and mds2 crashed with the same stack trace, mds1 was fine.

This looks like a subtle lock state inconsistency from the rejoin (mds restart) process. Are those two nodes still down? If you can reproduce this (or a related) crash after adding 'debug mds = 20' to [mds] that would be awesome!

Actions #4

Updated by ar Fred almost 14 years ago

Actions #5

Updated by ar Fred almost 14 years ago

Actions #6

Updated by Sage Weil almost 14 years ago

  • Status changed from New to 7

I pushed a fix to unstable that might fix the root cause of this, but it's hard to say. Can you leave 'debug mds = 20' in there, and restart all mds's, and if this comes back send the whole set of mds logs (full log, gzipped)? Or if the problem is fixed, even better.

Thanks!

Updated by ar Fred almost 14 years ago

just got the same crash of mds2 using b441fbdc9fdca271ed3bd100fc3c98c800b509b1

please find the full logs of each mds attached, mds2.log is the log for the crashed mds

Actions #8

Updated by Sage Weil almost 14 years ago

This looks an awful lot like it might be fixed by commit:15c6651ff57b88722b5c896f5698bf1d033e1f98. And possibly previous instances of that crash you saw were caused by that (that's the case that came up for me during testing).

But in the second case (the attached full logs) it was something else, which is now fixed by commit:15c6651ff57b88722b5c896f5698bf1d033e1f98.

Please give the latest unstable a test and let me know!

Actions #9

Updated by ar Fred almost 14 years ago

Indeed, can't reproduce the crash with the latest unstable.

I did 3-4 restart of all mds and it worked fine, that's the first time in a while I can see 3 mds up:active.

Thanks!

Actions #10

Updated by Sage Weil almost 14 years ago

  • Status changed from 7 to Resolved
Actions #11

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF