Project

General

Profile

Actions

Bug #1979

closed

osd: suicide timeout on recovery_tp... heap corruption?

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/var/lib/teuthworker/archive/nightly_coverage_2012-01-24-a/8882

common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)', in thread '7fa2a1bf4700'
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
 ceph version 0.40-217-g1e42109 (commit:1e4210937fa3951881d39425a0469dcadaae61bf)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x3ad) [0x7cdf0d]
 2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x7cf66f]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7cfa78]
 4: (CephContextServiceThread::entry()+0x77) [0x6b7f77]
 5: (Thread::_entry_func(void*)+0x12) [0x637042]
 6: (()+0x7971) [0x7fa2a3869971]
 7: (clone()+0x6d) [0x7fa2a1ef492d]
 ceph version 0.40-217-g1e42109 (commit:1e4210937fa3951881d39425a0469dcadaae61bf)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x3ad) [0x7cdf0d]
 2: (ceph::HeartbeatMap::is_healthy()+0x8f) [0x7cf66f]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7cfa78]
 4: (CephContextServiceThread::entry()+0x77) [0x6b7f77]
 5: (Thread::_entry_func(void*)+0x12) [0x637042]
 6: (()+0x7971) [0x7fa2a3869971]
 7: (clone()+0x6d) [0x7fa2a1ef492d]
*** Caught signal (Aborted) **

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #1978: osd: FAILED assert(!object_contexts.size())Resolved01/24/2012

Actions
Actions #1

Updated by Sage Weil about 12 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Sage Weil about 12 years ago

  • Priority changed from High to Urgent
Actions #3

Updated by Josh Durgin about 12 years ago

It looks like there was a use after free or heap corruption - the recovery thread was stuck waiting on an invalid object context lock:

(gdb) thread 54
[Switching to thread 54 (Thread 28563)]#0  0x00007fa2a38706b4 in __lll_lock_wait () from /lib/libpthread.so.0
(gdb) bt
#0  0x00007fa2a38706b4 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x00007fa2a386b849 in _L_lock_953 () from /lib/libpthread.so.0
#2  0x00007fa2a386b66b in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x00000000004b8df7 in Mutex::Lock (this=0x15133f8, no_lockdep=<value optimized out>) at ./common/Mutex.h:108
#4  0x000000000050458d in ondisk_read_lock (this=0x1b1d600, soid=..., v=...) at osd/ReplicatedPG.h:310
#5  ReplicatedPG::recover_object_replicas (this=0x1b1d600, soid=..., v=...) at osd/ReplicatedPG.cc:5387
#6  0x000000000050582d in ReplicatedPG::recover_replicas (this=0x1b1d600, max=<value optimized out>) at osd/ReplicatedPG.cc:5444
#7  0x0000000000506433 in ReplicatedPG::start_recovery_ops (this=0x1b1d600, max=1) at osd/ReplicatedPG.cc:5154
#8  0x00000000005aa082 in OSD::do_recovery (this=0xeee000, pg=0x1b1d600) at osd/OSD.cc:5008
#9  0x00000000005f7135 in OSD::RecoveryWQ::_process(PG*) ()
#10 0x00000000005cb4a2 in ThreadPool::WorkQueue<PG>::_void_process(void*) ()
#11 0x00000000006abc43 in ThreadPool::worker (this=0xeee540) at common/WorkQueue.cc:54
#12 0x00000000005d4a35 in ThreadPool::WorkThread::entry() ()
#13 0x0000000000637042 in Thread::_entry_func (arg=0x1513408) at common/Thread.cc:41
#14 0x00007fa2a3869971 in start_thread () from /lib/libpthread.so.0
#15 0x00007fa2a1ef492d in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()
(gdb) frame 3
#3  0x00000000004b8df7 in Mutex::Lock (this=0x15133f8, no_lockdep=<value optimized out>) at ./common/Mutex.h:108
108    ./common/Mutex.h: No such file or directory.
    in ./common/Mutex.h
(gdb) p *this
$1 = {name = 0xcc0000000103 <Address 0xcc0000000103 out of bounds>, id = 94208, recursive = false, lockdep = 204, backtrace = false, _m = {__data = {__lock = 2, __count = 90624, __owner = 90624, 
      __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0xe447a84f1e7c3300, __next = 0x10000001535}}, 
    __size = "\002\000\000\000\000b\001\000\000b\001", '\000' <repeats 14 times>, "3|\036O\250G\344\065\025\000\000\000\001\000", __align = 389227116232706}, nlock = 0}
Actions #4

Updated by Sage Weil about 12 years ago

Ooh, I saw exactly that yesterday. IIRC there was a get_object_context for some object (refcount now one), and then later we waited on the lock for it and it was invalid. This makes me think it was a bad delete of an old pointer..

The workload was

kernel:
  branch: master
interactive-on-error: true

roles:
- - mon.a
  - mds.0
  - osd.0
- - osd.1
- - osd.2
- - osd.3
- - osd.4
- - osd.5
- - client.0
tasks:
- ceph:
    path: /home/sage/src/ceph
    log-whitelist:
    - wrongly marked me down or wrong addr
    conf:
      osd:
        debug osd: 20
        debug filestore: 20
        debug ms: 1
- thrashosds:
    op_delay: 30
    clean_interval: 120
    max_dead: 0
    chance_down: .5
- rados:
    clients: [client.0]
    ops: 2000
    max_seconds: 600
    objects: 500
    op_weights:
      read: 45
      write: 45
      delete: 10
      snap_create: 50
      snap_remove: 50
      snap_rollback: 50

Actions #5

Updated by Sage Weil about 12 years ago

  • Status changed from New to Need More Info

pushed a patch to include pointer in get/put object_context to help narrow this down.

Actions #6

Updated by Sage Weil about 12 years ago

  • Subject changed from osd: suicide timeout on recovery_tp to osd: suicide timeout on recovery_tp... heap corruption?
Actions #7

Updated by Samuel Just about 12 years ago

  • Status changed from Need More Info to Duplicate

same as 1978

Actions

Also available in: Atom PDF