Bug #3773: mds crashed at LogEvent::decode - CephFS - Ceph

Actions

Copy link

Bug #3773

closed

mds crashed at LogEvent::decode

Added by Tamilarasi muthamizhan over 11 years ago. Updated almost 8 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Tamilarasi muthamizhan

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version: 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)

I had a cluster [burnupi06, burnupi07, burnupi08] running on ceph version 0.56 and upgraded them one by one to v0.56.1 while pumping IO from the client.

mds crashed when it was upgraded and restarted.

0> 2013-01-09 11:35:36.314638 7fa65d038700 -1 mds/LogEvent.cc: In function 'static LogEvent* LogEvent::decode(ceph::bufferlist&)' thread 7fa65d038700 time 2013-01-09 11:35:34.715866
mds/LogEvent.cc: 95: FAILED assert(p.end())

ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: (LogEvent::decode(ceph::buffer::list&)+0x9ff) [0x6b308f]
 2: (MDLog::_replay_thread()+0x2d8) [0x69e678]
 3: (MDLog::ReplayThread::entry()+0xd) [0x4c7c5d]
 4: (()+0x7e9a) [0x7fa664f2ae9a]
 5: (clone()+0x6d) [0x7fa663dcbcbd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

ubuntu@burnupi06:~$ sudo cat /etc/ceph/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
debug ms = 1

[osd]
osd journal size = 1000
filestore xattr use omap = true
debug osd = 20

[osd.1]
host = burnupi06

[osd.2]
host = burnupi06

[osd.3]
host = burnupi07

[osd.4]
host = burnupi07

[osd.5]
host = burnupi08

[osd.6]
host = burnupi08

[mon.a]
host = burnupi06
mon addr = 10.214.133.8:6789

[mon.b]
host = burnupi07
mon addr = 10.214.134.38:6789

[mon.c]
host = burnupi08
mon addr = 10.214.134.36:6789

[mds.a]
host = burnupi08

[client.radosgw.gateway]
host = burnupi06
keyring = /etc/ceph/keyring.radosgw.gateway
rgw socket path = /tmp/radosgw.sock
log file = /var/log/ceph/radosgw.log

leaving the cluster state as it is.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Project changed from Ceph to CephFS
Category changed from 1 to 47

Actions

Copy link

Updated by Greg Farnum over 11 years ago

Assignee set to Greg Farnum

I'll at least start this off.

Actions

Copy link

Updated by Ian Colle over 11 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Greg Farnum over 11 years ago

Okay, I gathered up a core file, a high-debug MDS log, and the log with the bad event (and the bad event itself) in the debug_3773 folder on burnupi08. The event has an extra 8 bytes on the end but we don't have any tools right now to analyze it in any pleasant fashion. I want to query Sage about this as it sounds semi-familiar but I'm not sure if we got the cause or not.

Tamil, what client were you using, what version, and what kind of IO were you doing?

Feel free to do whatever you like to the cluster at this point, but please do preserve the data I gathered somewhere. :)

GDB info for posterity in case digging through the core gets too hard:

(gdb) bt
#0  0x00007f1ba6c09b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000808bae in reraise_fatal (signum=6) at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  <signal handler called>
#4  0x00007f1ba59e5425 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007f1ba59e8b8b in __GI_abort () at abort.c:91
#6  0x00007f1ba633769d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f1ba6335846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f1ba6335873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f1ba633596e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0000000000773cff in ceph::__ceph_assert_fail (assertion=0x827413 "p.end()", file=<optimized out>, line=95, func=0x839180 "static LogEvent* LogEvent::decode(ceph::bufferlist&)")
    at common/assert.cc:77
#11 0x00000000006b308f in LogEvent::decode (bl=...) at mds/LogEvent.cc:95
#12 0x000000000069e678 in MDLog::_replay_thread (this=0x2b6e180) at mds/MDLog.cc:550
#13 0x00000000004c7c5d in MDLog::ReplayThread::entry (this=<optimized out>) at mds/MDLog.h:86
#14 0x00007f1ba6c01e9a in start_thread (arg=0x7f1b9ed0f700) at pthread_create.c:308
#15 0x00007f1ba5aa2cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#16 0x0000000000000000 in ?? ()
(gdb) f 11
#11 0x00000000006b308f in LogEvent::decode (bl=...) at mds/LogEvent.cc:95
95      mds/LogEvent.cc: No such file or directory.
(gdb) print p
$1 = {bl = 0x7f1b9ed0ec20, ls = 0x7f1b9ed0ec20, off = 1893, p = {_M_node = 0x2b9fbe0}, p_off = 774}

Actions

Copy link