Project

General

Profile

Actions

Bug #3497

closed

mon: leader segfaults after restarting osds

Added by Joao Eduardo Luis over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  -135> 2012-11-15 08:01:09.179382 a31a700 -1 *** Caught signal (Segmentation fault) **
 in thread a31a700

 ceph version 0.54-589-gd9bfbc1 (d9bfbc11160bd7b1d659b62238dbd0e4fd0204be)
 1: ./ceph-mon() [0x53d10a]
 2: (()+0xfcb0) [0x4e41cb0]
 3: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0x1d3) [0x5db373]
 4: (Monitor::send_reply(PaxosServiceMessage*, Message*)+0x475) [0x477625]
 5: (OSDMonitor::send_incremental(PaxosServiceMessage*, unsigned int)+0xc6) [0x4b1ad6]
 6: (OSDMonitor::send_latest(PaxosServiceMessage*, unsigned int)+0x79) [0x4bc729]
 7: (OSDMonitor::_booted(MOSDBoot*, bool)+0xd6) [0x4be076]
 8: (Context::complete(int)+0xa) [0x48ee8a]
 9: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x11d) [0x490cbd]
 10: (Paxos::handle_accept(MMonPaxos*)+0x83a) [0x4a4c2a]
 11: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4a7d0b]
 12: (Monitor::_ms_dispatch(Message*)+0xfb0) [0x48df90]
 13: (Monitor::ms_dispatch(Message*)+0x32) [0x49dac2]
 14: (DispatchQueue::entry()+0x349) [0x642019]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x5dc7ed]
 16: (()+0x7e9a) [0x4e39e9a]
 17: (clone()+0x6d) [0x64494bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Unfortunately, most of the log (everything that didn't fit the terminal's buffer) is unavailable.


Files

ceph-mon.a.log (82.4 KB) ceph-mon.a.log Joao Eduardo Luis, 11/15/2012 08:14 AM
Actions #1

Updated by Joao Eduardo Luis over 11 years ago

  • Description updated (diff)
Actions #2

Updated by Joao Eduardo Luis over 11 years ago

Might have jumped the gun on this description. Assumed too much from what I did when I wrote the description. The segfault appears to be related with restarting the osds; it just happened that I killed the slurping monitor by that time, but from the error message it had nothing to do with that.

Actions #3

Updated by Joao Eduardo Luis over 11 years ago

  • Subject changed from mon: leader segfaults when slurping peon is interrupted to mon: leader segfaults after restarting osds
Actions #4

Updated by Joao Eduardo Luis over 11 years ago

Different paxos machine, crashes on the same place after finishing the contexts. Only happens on wip-mon-leaks-fix afaict, after testing with next.

 ceph version 0.54-605-g6fce68a (6fce68ae1e5794f0a35813088e8a41729188a9d6)
 1: ./ceph-mon() [0x53d10a]
 2: (()+0xfcb0) [0x4e41cb0]
 3: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0x1d3) [0x5db373]
 4: (Monitor::send_reply(PaxosServiceMessage*, Message*)+0x475) [0x4773c5]
 5: (MDSMonitor::preprocess_beacon(MMDSBeacon*)+0x9ff) [0x4dc2ff]
 6: (MDSMonitor::preprocess_query(PaxosServiceMessage*)+0x271) [0x4debf1]
 7: (PaxosService::dispatch(PaxosServiceMessage*)+0x155) [0x4a9f85]
 8: (Context::complete(int)+0xa) [0x48ecda]
 9: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x11d) [0x490b0d]
 10: (Paxos::handle_accept(MMonPaxos*)+0x864) [0x4a49e4]
 11: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4a7a9b]
 12: (Monitor::_ms_dispatch(Message*)+0x1030) [0x48dda0]
 13: (Monitor::ms_dispatch(Message*)+0x32) [0x49d852]
 14: (DispatchQueue::entry()+0x349) [0x642029]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x5dc7ed]
 16: (()+0x7e9a) [0x4e39e9a]
 17: (clone()+0x6d) [0x64494bd]   
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #5

Updated by Joao Eduardo Luis over 11 years ago

  • Status changed from New to In Progress

After some testing, git bisect reports 19831b979a13f699b0e87125dfcfad3ea607d713 as the first bad commit.

Attempting a fix.

Actions #6

Updated by Joao Eduardo Luis over 11 years ago

  • Status changed from In Progress to Resolved

Removing said commit fixes the crash.

The patch was putting the Connection back as part of the session cleanup, so this will leave room for a connection lingering in memory and potentially the session as well, thus affecting completion of #3476.

Marking this as Resolved.

Actions

Also available in: Atom PDF