Project

General

Profile

Actions

Bug #3773

closed

mds crashed at LogEvent::decode

Added by Tamilarasi muthamizhan over 11 years ago. Updated almost 8 years ago.

Status:
Can't reproduce
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version: 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)

I had a cluster [burnupi06, burnupi07, burnupi08] running on ceph version 0.56 and upgraded them one by one to v0.56.1 while pumping IO from the client.

mds crashed when it was upgraded and restarted.

0> 2013-01-09 11:35:36.314638 7fa65d038700 -1 mds/LogEvent.cc: In function 'static LogEvent* LogEvent::decode(ceph::bufferlist&)' thread 7fa65d038700 time 2013-01-09 11:35:34.715866
mds/LogEvent.cc: 95: FAILED assert(p.end())
ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
1: (LogEvent::decode(ceph::buffer::list&)+0x9ff) [0x6b308f]
2: (MDLog::_replay_thread()+0x2d8) [0x69e678]
3: (MDLog::ReplayThread::entry()+0xd) [0x4c7c5d]
4: (()+0x7e9a) [0x7fa664f2ae9a]
5: (clone()+0x6d) [0x7fa663dcbcbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@burnupi06:~$ sudo cat /etc/ceph/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
debug ms = 1

[osd]
osd journal size = 1000
filestore xattr use omap = true
debug osd = 20

[osd.1]
host = burnupi06

[osd.2]
host = burnupi06

[osd.3]
host = burnupi07

[osd.4]
host = burnupi07

[osd.5]
host = burnupi08

[osd.6]
host = burnupi08

[mon.a]
host = burnupi06
mon addr = 10.214.133.8:6789

[mon.b]
host = burnupi07
mon addr = 10.214.134.38:6789

[mon.c]
host = burnupi08
mon addr = 10.214.134.36:6789

[mds.a]
host = burnupi08

[client.radosgw.gateway]
host = burnupi06
keyring = /etc/ceph/keyring.radosgw.gateway
rgw socket path = /tmp/radosgw.sock
log file = /var/log/ceph/radosgw.log

leaving the cluster state as it is.

Actions #1

Updated by Sage Weil over 11 years ago

  • Project changed from Ceph to CephFS
  • Category changed from 1 to 47
Actions #2

Updated by Greg Farnum over 11 years ago

  • Assignee set to Greg Farnum

I'll at least start this off.

Actions #3

Updated by Ian Colle over 11 years ago

  • Priority changed from Normal to High
Actions #4

Updated by Greg Farnum over 11 years ago

Okay, I gathered up a core file, a high-debug MDS log, and the log with the bad event (and the bad event itself) in the debug_3773 folder on burnupi08. The event has an extra 8 bytes on the end but we don't have any tools right now to analyze it in any pleasant fashion. I want to query Sage about this as it sounds semi-familiar but I'm not sure if we got the cause or not.

Tamil, what client were you using, what version, and what kind of IO were you doing?

Feel free to do whatever you like to the cluster at this point, but please do preserve the data I gathered somewhere. :)

GDB info for posterity in case digging through the core gets too hard:

(gdb) bt
#0  0x00007f1ba6c09b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000808bae in reraise_fatal (signum=6) at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  <signal handler called>
#4  0x00007f1ba59e5425 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007f1ba59e8b8b in __GI_abort () at abort.c:91
#6  0x00007f1ba633769d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f1ba6335846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f1ba6335873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f1ba633596e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0000000000773cff in ceph::__ceph_assert_fail (assertion=0x827413 "p.end()", file=<optimized out>, line=95, func=0x839180 "static LogEvent* LogEvent::decode(ceph::bufferlist&)")
    at common/assert.cc:77
#11 0x00000000006b308f in LogEvent::decode (bl=...) at mds/LogEvent.cc:95
#12 0x000000000069e678 in MDLog::_replay_thread (this=0x2b6e180) at mds/MDLog.cc:550
#13 0x00000000004c7c5d in MDLog::ReplayThread::entry (this=<optimized out>) at mds/MDLog.h:86
#14 0x00007f1ba6c01e9a in start_thread (arg=0x7f1b9ed0f700) at pthread_create.c:308
#15 0x00007f1ba5aa2cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#16 0x0000000000000000 in ?? ()
(gdb) f 11
#11 0x00000000006b308f in LogEvent::decode (bl=...) at mds/LogEvent.cc:95
95      mds/LogEvent.cc: No such file or directory.
(gdb) print p
$1 = {bl = 0x7f1b9ed0ec20, ls = 0x7f1b9ed0ec20, off = 1893, p = {_M_node = 0x2b9fbe0}, p_off = 774}
Actions #5

Updated by Greg Farnum over 11 years ago

  • Assignee deleted (Greg Farnum)
Actions #6

Updated by Sage Weil over 11 years ago

  • Status changed from New to Need More Info
  • Assignee set to Tamilarasi muthamizhan

Tamil, I wonder if you can try to reproduce this with mds logging turned up from teh start (debug mds = 20, debug ms = 1)? what was the workload?

We're at a loss as to where the bad journal entry came from :(

Actions #7

Updated by Tamilarasi muthamizhan over 11 years ago

Sure Sage. I was running bonnie from client during upgrade.
I had debug ms=1 set, i will try to reproduce this with mds debug set.

Actions #8

Updated by Tamilarasi muthamizhan over 11 years ago

  • Status changed from Need More Info to Can't reproduce

I have been trying to reproduce this but have not hit it yet.
will reopen the bug, when needed.

Actions #9

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF