Project

General

Profile

Actions

Bug #4569

closed

ceph-mds: segfault

Added by Noah Watkins about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I started receiving this segfault in ceph-mds with the latest master today.

Core was generated by `./ceph-mds -i a -c ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f4d50c92b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
42    ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
(gdb) bt
#0  0x00007f4d50c92b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000852736 in reraise_fatal (signum=11) at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3  <signal handler called>
#4  Mutex::Lock (this=0x10, no_lockdep=false) at common/Mutex.cc:80
#5  0x00000000007a1ee1 in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:120
#6  get_pipe (this=0x0) at msg/Message.h:211
#7  SimpleMessenger::mark_down (this=0x2a58000, con=0x0) at msg/SimpleMessenger.cc:589
#8  0x000000000054646c in Server::_session_logged (this=0x2a381c0, session=0x2a36b40, state_seq=<optimized out>, open=<optimized out>, pv=363, inos=..., piv=229) at mds/Server.cc:316
#9  0x00000000005888d7 in C_MDS_session_finish::finish (this=<optimized out>, r=<optimized out>) at mds/Server.cc:157
#10 0x00000000004dfc7a in Context::complete (this=0x358b5e0, r=<optimized out>) at ./include/Context.h:41
#11 0x00000000005047b4 in finish_contexts (cct=0x2a38000, finished=..., result=0) at ./include/Context.h:78
#12 0x00000000006e8f62 in Journaler::_finish_flush (this=0x2a80380, r=<optimized out>, start=5923952, stamp=...) at osdc/Journaler.cc:430
#13 0x0000000000704ef8 in Objecter::handle_osd_op_reply (this=0x2a80000, m=0x3595680) at osdc/Objecter.cc:1484
#14 0x00000000004fd987 in MDS::handle_core_message (this=this@entry=0x2a58b00, m=0x3595680) at mds/MDS.cc:1713
#15 0x00000000004fda83 in MDS::_dispatch (this=this@entry=0x2a58b00, m=m@entry=0x3595680) at mds/MDS.cc:1837
#16 0x00000000004ff87b in MDS::ms_dispatch (this=0x2a58b00, m=0x3595680) at mds/MDS.cc:1648
#17 0x000000000082642b in ms_deliver_dispatch (m=0x3595680, this=0x2a58000) at msg/Messenger.h:553
#18 DispatchQueue::entry (this=0x2a580e8) at msg/DispatchQueue.cc:107
#19 0x00000000007a984d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:85
#20 0x00007f4d50c8ae9a in start_thread (arg=0x7f4d4bb50700) at pthread_create.c:308
#21 0x00007f4d4f5e1cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#22 0x0000000000000000 in ?? ()
(gdb)

Files

mds.a.log2 (2.15 MB) mds.a.log2 mds log Noah Watkins, 03/27/2013 11:45 AM
gdb.txt (76.3 KB) gdb.txt thread apply all bt Noah Watkins, 03/27/2013 11:45 AM
Actions #1

Updated by Sam Lang about 11 years ago

It looks like the session is getting closed because its stale, and then killed, but the session->connection field passed to SimpleMessenger::mark_down() is NULL. I'm not seeing how that can be set to NULL anywhere in the current mds code...

Actions #2

Updated by Ian Colle about 11 years ago

  • Assignee set to Greg Farnum
  • Priority changed from Normal to Urgent
Actions #3

Updated by Greg Farnum about 11 years ago

In the logs the session in question is one that failed to reconnect. Was there a different event that caused the MDS to need to reconnect earlier?

Actions #4

Updated by Greg Farnum about 11 years ago

Yep, the problem here is that the Session was created during replay and it never had a Connection associated with it (see in the log how the only action on that client is the MDS giving up on it during replay, and then closing it because it's stale). Sage had a patch I reviewed a couple weeks ago that switched this mark down from using the address to using the Connection*. The addr-based mark_down is explicitly a no-op if there's no Connection, but if we're passing in NULL to the Connection*-based mark_down it tries to deref and fails horribly.

Looking at it to try and figure out where responsibility for handling this should fall now.

Actions #5

Updated by Noah Watkins about 11 years ago

In case it matters at all, the segfault was happening when I was furiously sigterm'n my hung-on-unlink client.

Actions #6

Updated by Greg Farnum about 11 years ago

  • Status changed from New to Resolved

commit:4f8ba0e7756a1b0647867db0e9b5549b3e82f6b1 in master. This wasn't a bug in any released versions, so no backports.

Actions #7

Updated by Sam Lang about 11 years ago

It looks like this fix didn't make it into 0.60. See #4696.

Actions

Also available in: Atom PDF