Bug #1979
osd: suicide timeout on recovery_tp... heap corruption?
Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8882
common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)', in thread '7fa2a1bf4700' common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout") ceph version 0.40-217-g1e42109 (commit:1e4210937fa3951881d39425a0469dcadaae61bf) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x3ad) [0x7cdf0d] 2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x7cf66f] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7cfa78] 4: (CephContextServiceThread::entry()+0x77) [0x6b7f77] 5: (Thread::_entry_func(void*)+0x12) [0x637042] 6: (()+0x7971) [0x7fa2a3869971] 7: (clone()+0x6d) [0x7fa2a1ef492d] ceph version 0.40-217-g1e42109 (commit:1e4210937fa3951881d39425a0469dcadaae61bf) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x3ad) [0x7cdf0d] 2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x7cf66f] 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7cfa78] 4: (CephContextServiceThread::entry()+0x77) [0x6b7f77] 5: (Thread::_entry_func(void*)+0x12) [0x637042] 6: (()+0x7971) [0x7fa2a3869971] 7: (clone()+0x6d) [0x7fa2a1ef492d] *** Caught signal (Aborted) **
Related issues
History
#1 Updated by Sage Weil almost 12 years ago
- Priority changed from Normal to High
#2 Updated by Sage Weil almost 12 years ago
- Priority changed from High to Urgent
#3 Updated by Josh Durgin almost 12 years ago
It looks like there was a use after free or heap corruption - the recovery thread was stuck waiting on an invalid object context lock:
(gdb) thread 54 [Switching to thread 54 (Thread 28563)]#0 0x00007fa2a38706b4 in __lll_lock_wait () from /lib/libpthread.so.0 (gdb) bt #0 0x00007fa2a38706b4 in __lll_lock_wait () from /lib/libpthread.so.0 #1 0x00007fa2a386b849 in _L_lock_953 () from /lib/libpthread.so.0 #2 0x00007fa2a386b66b in pthread_mutex_lock () from /lib/libpthread.so.0 #3 0x00000000004b8df7 in Mutex::Lock (this=0x15133f8, no_lockdep=<value optimized out>) at ./common/Mutex.h:108 #4 0x000000000050458d in ondisk_read_lock (this=0x1b1d600, soid=..., v=...) at osd/ReplicatedPG.h:310 #5 ReplicatedPG::recover_object_replicas (this=0x1b1d600, soid=..., v=...) at osd/ReplicatedPG.cc:5387 #6 0x000000000050582d in ReplicatedPG::recover_replicas (this=0x1b1d600, max=<value optimized out>) at osd/ReplicatedPG.cc:5444 #7 0x0000000000506433 in ReplicatedPG::start_recovery_ops (this=0x1b1d600, max=1) at osd/ReplicatedPG.cc:5154 #8 0x00000000005aa082 in OSD::do_recovery (this=0xeee000, pg=0x1b1d600) at osd/OSD.cc:5008 #9 0x00000000005f7135 in OSD::RecoveryWQ::_process(PG*) () #10 0x00000000005cb4a2 in ThreadPool::WorkQueue<PG>::_void_process(void*) () #11 0x00000000006abc43 in ThreadPool::worker (this=0xeee540) at common/WorkQueue.cc:54 #12 0x00000000005d4a35 in ThreadPool::WorkThread::entry() () #13 0x0000000000637042 in Thread::_entry_func (arg=0x1513408) at common/Thread.cc:41 #14 0x00007fa2a3869971 in start_thread () from /lib/libpthread.so.0 #15 0x00007fa2a1ef492d in clone () from /lib/libc.so.6 #16 0x0000000000000000 in ?? () (gdb) frame 3 #3 0x00000000004b8df7 in Mutex::Lock (this=0x15133f8, no_lockdep=<value optimized out>) at ./common/Mutex.h:108 108 ./common/Mutex.h: No such file or directory. in ./common/Mutex.h (gdb) p *this $1 = {name = 0xcc0000000103 <Address 0xcc0000000103 out of bounds>, id = 94208, recursive = false, lockdep = 204, backtrace = false, _m = {__data = {__lock = 2, __count = 90624, __owner = 90624, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0xe447a84f1e7c3300, __next = 0x10000001535}}, __size = "\002\000\000\000\000b\001\000\000b\001", '\000' <repeats 14 times>, "3|\036O\250G\344\065\025\000\000\000\001\000", __align = 389227116232706}, nlock = 0}
#4 Updated by Sage Weil almost 12 years ago
Ooh, I saw exactly that yesterday. IIRC there was a get_object_context for some object (refcount now one), and then later we waited on the lock for it and it was invalid. This makes me think it was a bad delete of an old pointer..
The workload was
kernel: branch: master interactive-on-error: true roles: - - mon.a - mds.0 - osd.0 - - osd.1 - - osd.2 - - osd.3 - - osd.4 - - osd.5 - - client.0 tasks: - ceph: path: /home/sage/src/ceph log-whitelist: - wrongly marked me down or wrong addr conf: osd: debug osd: 20 debug filestore: 20 debug ms: 1 - thrashosds: op_delay: 30 clean_interval: 120 max_dead: 0 chance_down: .5 - rados: clients: [client.0] ops: 2000 max_seconds: 600 objects: 500 op_weights: read: 45 write: 45 delete: 10 snap_create: 50 snap_remove: 50 snap_rollback: 50
#5 Updated by Sage Weil almost 12 years ago
- Status changed from New to Need More Info
pushed a patch to include pointer in get/put object_context to help narrow this down.
#6 Updated by Sage Weil almost 12 years ago
- Subject changed from osd: suicide timeout on recovery_tp to osd: suicide timeout on recovery_tp... heap corruption?
#7 Updated by Samuel Just almost 12 years ago
- Status changed from Need More Info to Duplicate
same as 1978