Project

General

Profile

Actions

Bug #733

closed

cmds crash: mds/LogEvent.cc:88: FAILED assert(p.end())

Added by Ravi Pinjala about 13 years ago. Updated almost 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When I start cmds, I get this crash:

mds/LogEvent.cc: In function 'static LogEvent* LogEvent::decode(ceph::bufferlist&)':
mds/LogEvent.cc:88: FAILED assert(p.end())
ceph version 0.24.1.1 (commit:785bf0fcbfb69efa8dd97340c8ee0079bb5ad55e)
1: (LogEvent::decode(ceph::buffer::list&)+0x2fb) [0x82ccaab]
2: (MDLog::_replay_thread()+0x728) [0x82ad598]
3: (MDLog::ReplayThread::entry()+0x14) [0x80eddd4]
4: (Thread::_entry_func(void*)+0x11) [0x80c8781]
5: (()+0x5cc9) [0xb76cecc9]
6: (clone()+0x5e) [0xb70cc69e]

It's possible that the log is corrupt, since the last bug I had in cmds also resulted in crashes while the mds was starting up.

I'm using git revision 785bf0fcbfb69efa8dd97340c8ee0079bb5ad55e (latest in the testing branch).


Files

mds.alpha.log (7.77 KB) mds.alpha.log Eric Dold, 05/23/2012 10:55 AM
mds.alpha.anon.log.tar.xz (8.04 MB) mds.alpha.anon.log.tar.xz Eric Dold, 05/24/2012 05:43 AM
mds.alpha.anon.log.tar.xz (8.18 MB) mds.alpha.anon.log.tar.xz Eric Dold, 06/05/2012 10:17 AM
Actions #1

Updated by Sage Weil about 13 years ago

Can you restart the mds with 'debug mds = 20' so we can see what events are getting replayed and which decode is failing?

Actions #2

Updated by Ravi Pinjala about 13 years ago

Odd, I can't repro this anymore. It was either fixed by some change between 785bf0fcbfb69efa8dd97340c8ee0079bb5ad55e and 0.24.2, or else it was crashing because some of my pgs were in an inconsistent state, which I learned how to fix the other day.

Leaving this open in case "pgs in an inconsistent state" is somehow enough to figure out what the problem is, but feel free to close this bug now.

Actions #3

Updated by Sage Weil about 13 years ago

  • Status changed from New to Can't reproduce

Hmm not enough to go on I'm afraid. If you see this again please let us know!

Actions #4

Updated by Eric Dold almost 12 years ago

I get the same with v0.47.1:

0> 2012-05-23 19:50:20.105956 7f7c87482700 -1 mds/LogEvent.cc: In function 'static LogEvent* LogEvent::decode(ceph::bufferlist&)' thread 7f7c87482700 time 2012-05-23 19:50:20.105525
mds/LogEvent.cc: 95: FAILED assert(p.end())
ceph version 0.47.1 (commit:f5a9404445e2ed5ec2ee828aa53d73d4a002f7a5)
1: (LogEvent::decode(ceph::buffer::list&)+0x29d) [0x6b530d]
2: (MDLog::_replay_thread()+0x668) [0x6a1e68]
3: (MDLog::ReplayThread::entry()+0xd) [0x4d5c1d]
4: (()+0x8ec6) [0x7f7c8e7edec6]
5: (clone()+0x6d) [0x7f7c8d69d51d]
Actions #5

Updated by Eric Dold almost 12 years ago

here is a backtrace:

Core was generated by `/usr/bin/ceph-mds -i alpha --pid-file /var/run/ceph/mds.alpha.pid -c /etc/ceph/'.
Program terminated with signal 6, Aborted.
#0 0x00007fc20cbe2a9b in raise () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fc20cbe2a9b in raise () from /lib64/libpthread.so.0
#1 0x00000000007e127c in reraise_fatal (signum=6) at global/signal_handler.cc:58
#2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3 <signal handler called>
#4 0x00007fc20b9d9a95 in raise () from /lib64/libc.so.6
#5 0x00007fc20b9daf0b in abort () from /lib64/libc.so.6
#6 0x00007fc20c306bed in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3 /libstdc++.so.6
#7 0x00007fc20c304da6 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6
#8 0x00007fc20c304dd3 in std::terminate() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6
#9 0x00007fc20c304ece in _cxa_throw () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6
#10 0x000000000077827f in ceph::
_ceph_assert_fail (assertion=0x7ff8b3 "p.end()", file=<optimized out>, line=95, func=0x811540 "static LogEvent* LogEvent::decode(ceph::bufferlist&)")
at common/assert.cc:77
#11 0x00000000006b530d in LogEvent::decode (bl=...) at mds/LogEvent.cc:95
#12 0x00000000006a1e68 in MDLog::_replay_thread (this=0x268e300) at mds/MDLog.cc:547
#13 0x00000000004d5c1d in MDLog::ReplayThread::entry (this=<optimized out>) at mds/MDLog.h:86
#14 0x00007fc20cbdaec6 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fc20ba8a51d in clone () from /lib64/libc.so.6
Actions #6

Updated by Greg Farnum almost 12 years ago

We'll need a detailed log (and possibly access to the data that's causing the crash) to diagnose this. Can you turn on:
debug ms = 1
debug mds = 20
and reproduce?

Actions #7

Updated by Eric Dold almost 12 years ago

here you go. a log with ms = 1 and mds = 20.
dirs and files are replaced with 'o's.

Actions #8

Updated by Greg Farnum almost 12 years ago

Aww, the actual debug line that's interesting here is generic_dout().
Can you do it again, this time adding "debug = 20" as well? That will specify what kind of event is being decoded (and failing), which will hopefully make it pretty easy to figure out if it's an easy code problem. If it's not, probably the data got corrupted somehow, which will take more effort to track down...

Actions #9

Updated by Eric Dold almost 12 years ago

ok here is a logfile with the following config:

[mds]
debug = 20
debug ms = 1
debug mds = 20
mds bal frag = true

just one mds is turned on.

Actions #10

Updated by Greg Farnum about 11 years ago

  • Project changed from Ceph to CephFS
  • Category changed from 1 to 47

This is at least the same crash as #4061, although it'd be nice to get one of these with logging on the caused end instead of the replay end... :/

Actions #11

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF