Project

General

Profile

Actions

Bug #297

closed

MDS crash on Objecter::handle_osd_op_reply

Added by Wido den Hollander almost 14 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While doing a rsync of kernel.org again, both my MDS'es crashed.

root@client02:~# dmesg
[41334.955328] ceph: loaded (mon/mds/osd proto 15/32/24, osdmap 5/5 5/5)
[41334.961379] ceph: client7602 fsid 1eaec0e5-d50e-49bd-f489-b3a719cde54f
[41334.961685] ceph: mon1 [2001:16f8:10:2::c3c3:3f9b]:6789 session established
[42520.051737] ceph: mds0 caps stale
[42535.050866] ceph: mds0 caps stale
[42594.018573] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 socket closed
[42595.040191] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 connection failed
[42596.040189] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 connection failed
[42598.040188] ceph: mds0 [2001:16f8:10:2::c3c3:3f9b]:6800 connection failed
[42603.820893] ceph: mds0 reconnect start
[42608.007911] ceph: mds0 reconnect success
[42661.050012] ceph: mds0 caps stale
[42722.338941] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 socket closed
[42723.040206] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42724.040178] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42726.040208] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42730.050204] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42738.060194] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42754.080203] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42786.080187] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42850.080204] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[42911.050022] ceph: mds0 hung
[42978.410277] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[43235.690319] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[43748.330270] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[44262.890259] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed
[44776.170282] ceph: mds0 [2001:16f8:10:2::c3c3:2e5c]:6800 connection failed

As you can see, first one MDS went down, and as soon as the client switched to the second one, it also went down.

The bt of mds0:

Core was generated by `/usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f2c80c41a75 in raise () from /lib/libc.so.6
(gdb) bt
#0  0x00007f2c80c41a75 in raise () from /lib/libc.so.6
#1  0x00007f2c80c455c0 in abort () from /lib/libc.so.6
#2  0x00007f2c814f68e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3  0x00007f2c814f4d16 in ?? () from /usr/lib/libstdc++.so.6
#4  0x00007f2c814f4d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5  0x00007f2c814f4e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x00000000005d0f84 in CInode::operator new (this=<value optimized out>, bl=<value optimized out>) at mds/CInode.h:66
#7  CDir::_fetched (this=<value optimized out>, bl=<value optimized out>) at mds/CDir.cc:1271
#8  0x000000000063b027 in Objecter::handle_osd_op_reply (this=0xfc5160, m=0x7f2bc9288d00) at osdc/Objecter.cc:550
#9  0x00000000004a1acd in MDS::_dispatch (this=0xfbb310, m=0x7f2bc9288d00) at mds/MDS.cc:1461
#10 0x00000000004a1c71 in MDS::ms_dispatch (this=0xfbb310, m=0x7f2bc9288d00) at mds/MDS.cc:1309
#11 0x000000000047cf59 in Messenger::ms_deliver_dispatch (this=0xfb9db0) at msg/Messenger.h:97
#12 SimpleMessenger::dispatch_entry (this=0xfb9db0) at msg/SimpleMessenger.cc:342
#13 0x0000000000474e4c in SimpleMessenger::DispatchThread::entry (this=0xfba238) at msg/SimpleMessenger.h:534
#14 0x00000000004878fa in Thread::_entry_func (arg=0x200f) at ./common/Thread.h:39
#15 0x00007f2c81ad49ca in start_thread () from /lib/libpthread.so.0
#16 0x00007f2c80cf46cd in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

I've uploaded the cores, logs and binary of both my MDS'es to logger.ceph.widodh.nl and placed them in /srv/ceph/issues/multiple_mds_crash_osd_reply

I was running:

root@client02:~# cat /usr/local/sbin/sync-kernel-mirror.sh 
#!/bin/sh

rsync -avr --stats --progress rsync://rsync.eu.kernel.org/pub/* /mnt/ceph/static/kernel/
root@client02:~#

The first sync finished without a problem, but it always goes wrong when you try to sync again, then the compare starts between local and remote.

Running on debug mds = 20 isn't possible since my 80GB disk gets filled with log files within a short time.

Actions #1

Updated by Wido den Hollander almost 14 years ago

Forgot to note my cluster state:

root@node13:/var/log/ceph# ceph -s
10.07.22_14:49:34.850111    pg v29381: 13808 pgs: 32 creating, 13776 active+clean; 651 GB data, 1310 GB used, 5073 GB / 6383 GB avail
10.07.22_14:49:34.889439   mds e283: 1/1/1 up {0=up:rejoin(laggy or crashed)}, 1 up:standby(laggy or crashed)
10.07.22_14:49:34.889479   osd e1110: 30 osds: 30 up, 30 in
10.07.22_14:49:34.889642   log 10.07.22_14:17:00.672338 mon0 [2001:16f8:10:2::c3c3:3f9b]:6789/0 1661 : [WRN] 
10.07.22_14:49:34.889740   mon e1: 2 mons at [2001:16f8:10:2::c3c3:3f9b]:6789/0 [2001:16f8:10:2::c3c3:2e5c]:6789/0
root@node13:/var/log/ceph#
Actions #2

Updated by Wido den Hollander almost 14 years ago

I've tried restarting the MDS'es multiple times, all resulting in the same crash again from both MDS'es.

The core-files are all uploaded (preserved the timestamps) (and updated logs) are all uploaded to logger.ceph.widodh.nl in the previous stated directory.

Actions #3

Updated by Sage Weil almost 14 years ago

  • Status changed from New to Closed

this is just out of memory. opened up #299 to improve logging.

Actions #4

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF