Bug #1979: osd: suicide timeout on recovery_tp... heap corruption? - Ceph - Ceph

Actions

Copy link

Bug #1979

closed

osd: suicide timeout on recovery_tp... heap corruption?

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8882

common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)', in thread '7fa2a1bf4700'
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
 ceph version 0.40-217-g1e42109 (commit:1e4210937fa3951881d39425a0469dcadaae61bf)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x3ad) [0x7cdf0d]
 2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x7cf66f]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7cfa78]
 4: (CephContextServiceThread::entry()+0x77) [0x6b7f77]
 5: (Thread::_entry_func(void*)+0x12) [0x637042]
 6: (()+0x7971) [0x7fa2a3869971]
 7: (clone()+0x6d) [0x7fa2a1ef492d]
 ceph version 0.40-217-g1e42109 (commit:1e4210937fa3951881d39425a0469dcadaae61bf)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x3ad) [0x7cdf0d]
 2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x7cf66f]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7cfa78]
 4: (CephContextServiceThread::entry()+0x77) [0x6b7f77]
 5: (Thread::_entry_func(void*)+0x12) [0x637042]
 6: (()+0x7971) [0x7fa2a3869971]
 7: (clone()+0x6d) [0x7fa2a1ef492d]
*** Caught signal (Aborted) **

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil about 12 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Sage Weil about 12 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by Josh Durgin about 12 years ago

It looks like there was a use after free or heap corruption - the recovery thread was stuck waiting on an invalid object context lock:

(gdb) thread 54
[Switching to thread 54 (Thread 28563)]#0  0x00007fa2a38706b4 in __lll_lock_wait () from /lib/libpthread.so.0
(gdb) bt
#0  0x00007fa2a38706b4 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x00007fa2a386b849 in _L_lock_953 () from /lib/libpthread.so.0
#2  0x00007fa2a386b66b in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x00000000004b8df7 in Mutex::Lock (this=0x15133f8, no_lockdep=<value optimized out>) at ./common/Mutex.h:108
#4  0x000000000050458d in ondisk_read_lock (this=0x1b1d600, soid=..., v=...) at osd/ReplicatedPG.h:310
#5  ReplicatedPG::recover_object_replicas (this=0x1b1d600, soid=..., v=...) at osd/ReplicatedPG.cc:5387
#6  0x000000000050582d in ReplicatedPG::recover_replicas (this=0x1b1d600, max=<value optimized out>) at osd/ReplicatedPG.cc:5444
#7  0x0000000000506433 in ReplicatedPG::start_recovery_ops (this=0x1b1d600, max=1) at osd/ReplicatedPG.cc:5154
#8  0x00000000005aa082 in OSD::do_recovery (this=0xeee000, pg=0x1b1d600) at osd/OSD.cc:5008
#9  0x00000000005f7135 in OSD::RecoveryWQ::_process(PG*) ()
#10 0x00000000005cb4a2 in ThreadPool::WorkQueue<PG>::_void_process(void*) ()
#11 0x00000000006abc43 in ThreadPool::worker (this=0xeee540) at common/WorkQueue.cc:54
#12 0x00000000005d4a35 in ThreadPool::WorkThread::entry() ()
#13 0x0000000000637042 in Thread::_entry_func (arg=0x1513408) at common/Thread.cc:41
#14 0x00007fa2a3869971 in start_thread () from /lib/libpthread.so.0
#15 0x00007fa2a1ef492d in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()
(gdb) frame 3
#3  0x00000000004b8df7 in Mutex::Lock (this=0x15133f8, no_lockdep=<value optimized out>) at ./common/Mutex.h:108
108    ./common/Mutex.h: No such file or directory.
    in ./common/Mutex.h
(gdb) p *this
$1 = {name = 0xcc0000000103 <Address 0xcc0000000103 out of bounds>, id = 94208, recursive = false, lockdep = 204, backtrace = false, _m = {__data = {__lock = 2, __count = 90624, __owner = 90624, 
      __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0xe447a84f1e7c3300, __next = 0x10000001535}}, 
    __size = "\002\000\000\000\000b\001\000\000b\001", '\000' <repeats 14 times>, "3|\036O\250G\344\065\025\000\000\000\001\000", __align = 389227116232706}, nlock = 0}

Actions

Copy link

Updated by Sage Weil about 12 years ago

Ooh, I saw exactly that yesterday. IIRC there was a get_object_context for some object (refcount now one), and then later we waited on the lock for it and it was invalid. This makes me think it was a bad delete of an old pointer..

The workload was

kernel:
  branch: master
interactive-on-error: true

roles:
- - mon.a
  - mds.0
  - osd.0
- - osd.1
- - osd.2
- - osd.3
- - osd.4
- - osd.5
- - client.0
tasks:
- ceph:
    path: /home/sage/src/ceph
    log-whitelist:
    - wrongly marked me down or wrong addr
    conf:
      osd:
        debug osd: 20
        debug filestore: 20
        debug ms: 1
- thrashosds:
    op_delay: 30
    clean_interval: 120
    max_dead: 0
    chance_down: .5
- rados:
    clients: [client.0]
    ops: 2000
    max_seconds: 600
    objects: 500
    op_weights:
      read: 45
      write: 45
      delete: 10
      snap_create: 50
      snap_remove: 50
      snap_rollback: 50

Actions

Copy link