Project

General

Profile

Actions

Bug #297

closed

MDS crash on Objecter::handle_osd_op_reply

Added by Wido den Hollander almost 14 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While doing a rsync of kernel.org again, both my MDS'es crashed.

root@client02:~# dmesg
[41334.955328] ceph: loaded (mon/mds/osd proto 15/32/24, osdmap 5/5 5/5)
[41334.961379] ceph: client7602 fsid 1eaec0e5-d50e-49bd-f489-b3a719cde54f
[41334.961685] ceph: mon1 [2001:16f8:10:2::c3c3:3f9b]:6789 session established
[42520.051737] ceph: mds0 caps stale
[42535.050866] ceph: mds0 caps stale
[42594.018573] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 socket closed
[42595.040191] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 connection failed
[42596.040189] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 connection failed
[42598.040188] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 connection failed
[42603.820893] ceph: mds0 reconnect start
[42608.007911] ceph: mds0 reconnect success
[42661.050012] ceph: mds0 caps stale
[42722.338941] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 socket closed
[42723.040206] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42724.040178] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42726.040208] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42730.050204] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42738.060194] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42754.080203] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42786.080187] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42850.080204] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42911.050022] ceph: mds0 hung
[42978.410277] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[43235.690319] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[43748.330270] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[44262.890259] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[44776.170282] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed

As you can see, first one MDS went down, and as soon as the client switched to the second one, it also went down.

The bt of mds0:

Core was generated by `/usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f2c80c41a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007f2c80c41a75 in raise () from /lib/libc.so.6
#1  0x00007f2c80c455c0 in abort () from /lib/libc.so.6
#2  0x00007f2c814f68e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007f2c814f4d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007f2c814f4d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007f2c814f4e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005d0f84 in CInode::operator new (this=<value optimized out>, bl=<value optimized out>) at mds/CInode.h:66
#7  CDir::_fetched (this=<value optimized out>, bl=<value optimized out>) at mds/CDir.cc:1271
#8  0x000000000063b027 in Objecter::handle_osd_op_reply (this=0xfc5160, m=0x7f2bc9288d00) at osdc/Objecter.cc:550
#9  0x00000000004a1acd in MDS::_dispatch (this=0xfbb310, m=0x7f2bc9288d00) at mds/MDS.cc:1461
#10 0x00000000004a1c71 in MDS::ms_dispatch (this=0xfbb310, m=0x7f2bc9288d00) at mds/MDS.cc:1309
#11 0x000000000047cf59 in Messenger::ms_deliver_dispatch (this=0xfb9db0) at msg/Messenger.h:97
#12 SimpleMessenger::dispatch_entry (this=0xfb9db0) at msg/SimpleMessenger.cc:342
#13 0x0000000000474e4c in SimpleMessenger::DispatchThread::entry (this=0xfba238) at msg/SimpleMessenger.h:534
#14 0x00000000004878fa in Thread::_entry_func (arg=0x200f) at ./common/Thread.h:39
#15 0x00007f2c81ad49ca in start_thread () from /lib/libpthread.so.0
#16 0x00007f2c80cf46cd in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

I've uploaded the cores, logs and binary of both my MDS'es to logger.ceph.widodh.nl and placed them in /srv/ceph/issues/multiple_mds_crash_osd_reply

I was running:

root@client02:~# cat /usr/local/sbin/sync-kernel-mirror.sh 
#!/bin/sh

rsync -avr --stats --progress rsync://rsync.eu.kernel.org/pub/* /mnt/ceph/static/kernel/
root@client02:~#

The first sync finished without a problem, but it always goes wrong when you try to sync again, then the compare starts between local and remote.

Running on debug mds = 20 isn't possible since my 80GB disk gets filled with log files within a short time.

Actions

Also available in: Atom PDF