Project

General

Profile

Bug #2033

osd: segfault in OSD::update_heartbeat_peers()

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

just hit this twice, on two different clusters, both under testrados workloads.

*** Caught signal (Segmentation fault) **
 in thread 7fa7e38eb700
2012-02-06 20:02:57.846338 7fa7e78f3700 osd.0 1608 pg[2.0p2( empty n=0 ec=1 les/c 1604/1605 1606/1606/1606) [4,0] r=1 lpr=1608 active] _activate_committed 1606, that was an old interval
 ceph version 0.41-184-g8ded264 (commit:8ded26472058d5205803f244c2f33cb6cb10de79)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x8c8f7b]
 2: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0xa59a51]
 3: (()+0xfb40) [0x7fa7efd7db40]
 4: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x13) [0x7fa7eeba29c3]
 5: (std::_Rb_tree_iterator<std::pair<int const, pg_info_t> >::operator++()+0x1b) [0x87a6a3]
 6: (OSD::update_heartbeat_peers()+0x359) [0x839917]
 7: (OSD::handle_pg_log(OpRequest*)+0x654) [0x85940c]
 8: (OSD::dispatch_op(OpRequest*)+0xab) [0x847979]
 9: (OSD::_dispatch(Message*)+0x9f7) [0x848439]
 10: (OSD::ms_dispatch(Message*)+0x1a3) [0x846a95]
 11: (Messenger::ms_deliver_dispatch(Message*)+0x8b) [0x9b86c7]
 12: (SimpleMessenger::dispatch_entry()+0x7c2) [0x9a1e2c]
 13: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x7636a6]
 14: (Thread::_entry_func(void*)+0x23) [0x8bb58d]
 15: (()+0x7971) [0x7fa7efd75971]
 16: (clone()+0x6d) [0x7fa7ee40092d]

Related issues

Related to Ceph - Cleanup #2049: osd: improve heartbeat peer locking Resolved

History

#1 Updated by Sage Weil over 9 years ago

#0  0x00007f75c1dcdf2b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x000000000063ffe2 in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  0x00000000006401ad in handle_fatal_signal (signum=11) at global/signal_handler.cc:109
#3  <signal handler called>
#4  0x00007f75c0bbc804 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x0000000000543ab1 in operator++ (this=<synthetic pointer>) at /usr/include/c++/4.6/bits/stl_tree.h:188
#6  OSD::update_heartbeat_peers (this=0x1fef000) at osd/OSD.cc:1469
#7  0x000000000055886c in OSD::handle_pg_log (this=0x1fef000, op=0x1fdfcc0) at osd/OSD.cc:4479
#8  0x0000000000560efd in OSD::_dispatch (this=0x1fef000, m=0x20b3680) at osd/OSD.cc:2956
#9  0x000000000056159c in OSD::ms_dispatch (this=0x1fef000, m=0x20b3680) at osd/OSD.cc:2732
#10 0x00000000005f6be3 in ms_deliver_dispatch (m=0x20b3680, this=0x1fc9200) at msg/Messenger.h:103
#11 SimpleMessenger::dispatch_entry (this=0x1fc9200) at msg/SimpleMessenger.cc:364
#12 0x00000000004b681c in SimpleMessenger::DispatchThread::entry (this=<optimized out>) at msg/SimpleMessenger.h:530
#13 0x00007f75c1dc5efc in start_thread (arg=0x7f75b50e5700) at pthread_create.c:304
#14 0x00007f75c03f689d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()

we are not holding pg->lock, but AFAICS we never touch peer_info outside of osd_lock.. and we do hold that

#2 Updated by Sage Weil over 9 years ago

  • Status changed from New to Closed

I'm not totally sure how this happened, but the new heartbeat locking should avoid it..

Also available in: Atom PDF