Bug #733: cmds crash: mds/LogEvent.cc:88: FAILED assert(p.end()) - CephFS - Ceph

Actions

Copy link

Bug #733

closed

cmds crash: mds/LogEvent.cc:88: FAILED assert(p.end())

Added by Ravi Pinjala about 13 years ago. Updated almost 8 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When I start cmds, I get this crash:

mds/LogEvent.cc: In function 'static LogEvent* LogEvent::decode(ceph::bufferlist&)':
mds/LogEvent.cc:88: FAILED assert(p.end())
ceph version 0.24.1.1 (commit:785bf0fcbfb69efa8dd97340c8ee0079bb5ad55e)
1: (LogEvent::decode(ceph::buffer::list&)+0x2fb) [0x82ccaab]
2: (MDLog::_replay_thread()+0x728) [0x82ad598]
3: (MDLog::ReplayThread::entry()+0x14) [0x80eddd4]
4: (Thread::_entry_func(void*)+0x11) [0x80c8781]
5: (()+0x5cc9) [0xb76cecc9]
6: (clone()+0x5e) [0xb70cc69e]

It's possible that the log is corrupt, since the last bug I had in cmds also resulted in crashes while the mds was starting up.

I'm using git revision 785bf0fcbfb69efa8dd97340c8ee0079bb5ad55e (latest in the testing branch).

Files

Download all files

mds.alpha.log (7.77 KB) mds.alpha.log		Eric Dold, 05/23/2012 10:55 AM
mds.alpha.anon.log.tar.xz (8.04 MB) mds.alpha.anon.log.tar.xz		Eric Dold, 05/24/2012 05:43 AM
mds.alpha.anon.log.tar.xz (8.18 MB) mds.alpha.anon.log.tar.xz		Eric Dold, 06/05/2012 10:17 AM

Actions

Copy link

Updated by Sage Weil about 13 years ago

Can you restart the mds with 'debug mds = 20' so we can see what events are getting replayed and which decode is failing?

Actions

Copy link

Updated by Ravi Pinjala about 13 years ago

Odd, I can't repro this anymore. It was either fixed by some change between 785bf0fcbfb69efa8dd97340c8ee0079bb5ad55e and 0.24.2, or else it was crashing because some of my pgs were in an inconsistent state, which I learned how to fix the other day.

Leaving this open in case "pgs in an inconsistent state" is somehow enough to figure out what the problem is, but feel free to close this bug now.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Status changed from New to Can't reproduce

Hmm not enough to go on I'm afraid. If you see this again please let us know!

Actions

Copy link

Updated by Eric Dold almost 12 years ago

File mds.alpha.log mds.alpha.log added

I get the same with v0.47.1:

0> 2012-05-23 19:50:20.105956 7f7c87482700 -1 mds/LogEvent.cc: In function 'static LogEvent* LogEvent::decode(ceph::bufferlist&)' thread 7f7c87482700 time 2012-05-23 19:50:20.105525
mds/LogEvent.cc: 95: FAILED assert(p.end())

ceph version 0.47.1 (commit:f5a9404445e2ed5ec2ee828aa53d73d4a002f7a5)
 1: (LogEvent::decode(ceph::buffer::list&)+0x29d) [0x6b530d]
 2: (MDLog::_replay_thread()+0x668) [0x6a1e68]
 3: (MDLog::ReplayThread::entry()+0xd) [0x4d5c1d]
 4: (()+0x8ec6) [0x7f7c8e7edec6]
 5: (clone()+0x6d) [0x7f7c8d69d51d]

Actions

Copy link

Updated by Eric Dold almost 12 years ago

here is a backtrace:

Core was generated by `/usr/bin/ceph-mds -i alpha --pid-file /var/run/ceph/mds.alpha.pid -c /etc/ceph/'.
 Program terminated with signal 6, Aborted.
 #0  0x00007fc20cbe2a9b in raise () from /lib64/libpthread.so.0
 (gdb) bt
 #0  0x00007fc20cbe2a9b in raise () from /lib64/libpthread.so.0
 #1  0x00000000007e127c in reraise_fatal (signum=6) at global/signal_handler.cc:58
 #2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
 #3  &lt;signal handler called&gt;
 #4  0x00007fc20b9d9a95 in raise () from /lib64/libc.so.6
 #5  0x00007fc20b9daf0b in abort () from /lib64/libc.so.6
 #6  0x00007fc20c306bed in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3 /libstdc++.so.6
#7  0x00007fc20c304da6 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6
#8  0x00007fc20c304dd3 in std::terminate() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6
#9  0x00007fc20c304ece in _cxa_throw () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6
#10 0x000000000077827f in ceph::_ceph_assert_fail (assertion=0x7ff8b3 "p.end()", file=&lt;optimized out&gt;, line=95, func=0x811540 "static LogEvent* LogEvent::decode(ceph::bufferlist&)")
    at common/assert.cc:77
#11 0x00000000006b530d in LogEvent::decode (bl=...) at mds/LogEvent.cc:95
#12 0x00000000006a1e68 in MDLog::_replay_thread (this=0x268e300) at mds/MDLog.cc:547
#13 0x00000000004d5c1d in MDLog::ReplayThread::entry (this=&lt;optimized out&gt;) at mds/MDLog.h:86
#14 0x00007fc20cbdaec6 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fc20ba8a51d in clone () from /lib64/libc.so.6

Actions

Copy link

Updated by Greg Farnum almost 12 years ago

We'll need a detailed log (and possibly access to the data that's causing the crash) to diagnose this. Can you turn on:
debug ms = 1
debug mds = 20
and reproduce?

Actions

Copy link

Updated by Eric Dold almost 12 years ago

File mds.alpha.anon.log.tar.xz mds.alpha.anon.log.tar.xz added

here you go. a log with ms = 1 and mds = 20.
dirs and files are replaced with 'o's.

Actions

Copy link

Updated by Greg Farnum almost 12 years ago

Aww, the actual debug line that's interesting here is generic_dout().
Can you do it again, this time adding "debug = 20" as well? That will specify what kind of event is being decoded (and failing), which will hopefully make it pretty easy to figure out if it's an easy code problem. If it's not, probably the data got corrupted somehow, which will take more effort to track down...

Actions

Copy link

Updated by Eric Dold almost 12 years ago

File mds.alpha.anon.log.tar.xz mds.alpha.anon.log.tar.xz added

ok here is a logfile with the following config:

[mds]
        debug = 20
        debug ms = 1
        debug mds = 20
        mds bal frag = true

just one mds is turned on.

Actions

Copy link

#10

Updated by Greg Farnum about 11 years ago

Project changed from Ceph to CephFS
Category changed from 1 to 47

This is at least the same crash as #4061, although it'd be nice to get one of these with logging on the caused end instead of the replay end... :/

Actions

Copy link

#11

Updated by Greg Farnum almost 8 years ago

Component(FS) MDS added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #733

cmds crash: mds/LogEvent.cc:88: FAILED assert(p.end())

Updated by Sage Weil about 13 years ago

Updated by Ravi Pinjala about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Eric Dold almost 12 years ago

Updated by Eric Dold almost 12 years ago

Updated by Greg Farnum almost 12 years ago

Updated by Eric Dold almost 12 years ago

Updated by Greg Farnum almost 12 years ago

Updated by Eric Dold almost 12 years ago

Updated by Greg Farnum about 11 years ago

Updated by Greg Farnum almost 8 years ago