Project

General

Profile

Actions

Bug #3678

closed

osd: tcmalloc segfault in PG::CephPeeringEvt::CephPeeringEvt<PG::MNotifyRec>()

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -35> 2012-12-24 11:54:50.445332 7fca638d1700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fca638d1700

 ceph version 0.55.1-357-ga3aa00d (a3aa00d149829a688d5c2a7be53d07a482d8fc2a)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x1068a37]
 2: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x10cdf16]
 3: (()+0xfcb0) [0x7fca70a70cb0]
 4: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7fca6fb16df7]
 5: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x107) [0x7fca6fb17167]
 6: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7fca6fb19cad]
 7: (tc_new()+0x486) [0x7fca6fb28c76]
 8: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7fca6f46ca89]
 9: (char* std::string::_S_construct<char*>(char*, char*, std::allocator<char> const&, std::forward_iterator_tag)+0xe6) [0x118953a]
 10: (char* std::string::_S_construct_aux<char*>(char*, char*, std::allocator<char> const&, std::__false_type)+0x2f) [0x1189451]
 11: (char* std::string::_S_construct<char*>(char*, char*, std::allocator<char> const&)+0x2f) [0x1189420]
 12: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<char*>(char*, char*, std::allocator<char> const&)+0x2f) [0x11893d9]
 13: (std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::str() const+0x78) [0x7fca6f468828]
 14: (PG::CephPeeringEvt::CephPeeringEvt<PG::MNotifyRec>(unsigned int, unsigned int, PG::MNotifyRec const&)+0x11c) [0xef1d80]
 15: (PG::queue_notify(unsigned int, unsigned int, int, pg_notify_t&)+0x17a) [0xfb9906]
 16: (OSD::handle_pg_notify(std::tr1::shared_ptr<OpRequest>)+0x632) [0xec8472]
 17: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xc5) [0xeb6233]
 18: (OSD::_dispatch(Message*)+0x81b) [0xeb6f11]
 19: (OSD::ms_dispatch(Message*)+0x1b1) [0xeb50cd]
 20: (Messenger::ms_deliver_dispatch(Message*)+0x9b) [0x1271a49]
 21: (DispatchQueue::entry()+0x549) [0x12711b1]
 22: (DispatchQueue::DispatchThread::entry()+0x1c) [0x1191d6e]
 23: (Thread::_entry_func(void*)+0x23) [0x1199aed]
 24: (()+0x7e9a) [0x7fca70a68e9a]
 25: (clone()+0x6d) [0x7fca6ebf34bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

on ubuntu@teuthology:/a/sage-peer3/25356.

full osd and ms log!

Actions #1

Updated by Sage Weil over 11 years ago

another one:

Thread 1 (Thread 0x7f970b09a700 (LWP 29228)):
#0  0x00007f971ed028ee in tc_new () from /usr/lib/libtcmalloc.so.0
#1  0x00007f971e646a89 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00000000007cf113 in _S_construct<char*> (__a=..., __end=<optimized out>, __beg=0x2d31b58 "ceph version 0.55.1-384-gd9673ca (d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd)l-les=172 n=0 ec=1 les/c 172/153 171/171/152)\n") at /usr/include/c++/4.6/bits/basic_string.tcc:140
#3  _S_construct_aux<char*> (__a=..., __end=<optimized out>, __beg=0x2d31b58 "ceph version 0.55.1-384-gd9673ca (d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd)l-les=172 n=0 ec=1 les/c 172/153 171/171/152)\n") at /usr/include/c++/4.6/bits/basic_string.h:1692
#4  _S_construct<char*> (__a=..., __end=<optimized out>, __beg=0x2d31b58 "ceph version 0.55.1-384-gd9673ca (d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd)l-les=172 n=0 ec=1 les/c 172/153 171/171/152)\n") at /usr/include/c++/4.6/bits/basic_string.h:1713
#5  std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<char*> (this=0x7f970b0983d0, __beg=0x2d31b58 "ceph version 0.55.1-384-gd9673ca (d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd)l-les=172 n=0 ec=1 les/c 172/153 171/171/152)\n", __end=<optimized out>, __a=...)
    at /usr/include/c++/4.6/bits/basic_string.tcc:231
#6  0x0000000000a2a7c5 in str (this=0x7f970b098268) at /usr/include/c++/4.6/sstream:129
#7  str (this=0x7f970b098260) at /usr/include/c++/4.6/sstream:458
#8  pretty_version_to_str () at common/version.cc:40
#9  0x000000000089d428 in ceph::BackTrace::print (this=0x7f970b0984c0, out=...) at common/BackTrace.cc:19
#10 0x0000000000899d53 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91
#11 <signal handler called>
#12 0x00007f971ed028ee in tc_new () from /usr/lib/libtcmalloc.so.0
#13 0x00007f971e646a89 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f971e6477fb in std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007f971e6478e0 in std::string::reserve(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007f971e647a3f in std::string::append(unsigned long, char) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x0000000000955144 in resize (__n=<optimized out>, this=0x3bda4d0) at /usr/include/c++/4.6/bits/basic_string.h:749
#18 PrebufferedStreambuf::overflow (this=<optimized out>, c=32) at common/PrebufferedStreambuf.cc:18
#19 0x00007f971e645085 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007f971e63c055 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007f971e63c2ef in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x0000000000a5b4cd in Pipe::_pipe_prefix (this=0x2fa66c0, _dout=<optimized out>) at msg/Pipe.cc:43
#23 0x0000000000a693b2 in Pipe::fault (this=0x2fa66c0, onread=<optimized out>) at msg/Pipe.cc:1150
#24 0x0000000000a7b28d in Pipe::reader (this=0x2fa66c0) at msg/Pipe.cc:1244
#25 0x0000000000a7e785 in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#26 0x00000000009665e2 in Thread::_entry_func (arg=<optimized out>) at common/Thread.cc:41
#27 0x00007f971fc42e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#28 0x00007f971ddcd4bd in klogctl () from /lib/x86_64-linux-gnu/libc.so.6
#29 0x0000000000000000 in ?? ()
(gdb) 
(gdb) 
(gdb) t 51
[Switching to thread 51 (Thread 0x7f970a38d700 (LWP 28092))]
#0  0x00007f971ddc1b03 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f971ddc1b03 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f971ecf3e77 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.0
#2  0x00007f971fc46104 in _L_unlock_644 () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f971fc46063 in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00000000026d1d80 in ?? ()
#5  0x00000000026d1e90 in ?? ()
#6  0x0000000002a3b8c0 in ?? ()
#7  0x0000000002d51000 in ?? ()
#8  0x0000000000a7ac02 in Pipe::reader (this=0x20) at msg/Pipe.cc:1241
#9  0x0000000000a7e785 in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#10 0x00000000009665e2 in Thread::_entry_func (arg=<optimized out>) at common/Thread.cc:41
#11 0x00007f971fc42e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f971ddcd4bd in klogctl () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

ubuntu@teuthology:/a/teuthology-2012-12-25_03:00:02-regression-master-testing-gcov/27524
job was
kernel:
  kdb: true
  sha1: ec18aeecd4de479601363849d489668d8f12410c
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        rbd cache: false
      global:
        ms inject socket failures: 5000
    coverage: true
    fs: ext4
    log-whitelist:
    - slow request
    sha1: d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd
  s3tests:
    branch: master
  workunit:
    sha1: d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 2000

Actions #2

Updated by Sage Weil over 11 years ago

  • Priority changed from High to Urgent

another run:

(gdb) info th
  Id   Target Id         Frame 
  52   Thread 0x7f70b612e700 (LWP 23111) "ceph-osd" 0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
  51   Thread 0x7f70b592d700 (LWP 23113) "ceph-osd" 0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
  50   Thread 0x7f70b512c700 (LWP 23114) "ceph-osd" 0x00007f70b642eb03 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  49   Thread 0x7f70b0122700 (LWP 23153) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  48   Thread 0x7f70b0923700 (LWP 23154) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  47   Thread 0x7f70b1124700 (LWP 23155) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  46   Thread 0x7f70b3929700 (LWP 23156) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  45   Thread 0x7f70b492b700 (LWP 23157) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  44   Thread 0x7f70b412a700 (LWP 23158) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  43   Thread 0x7f70b3128700 (LWP 23159) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  42   Thread 0x7f70b2927700 (LWP 23205) "ceph-osd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  41   Thread 0x7f70b2126700 (LWP 23206) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  40   Thread 0x7f70b1925700 (LWP 23207) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  39   Thread 0x7f70af921700 (LWP 23208) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  38   Thread 0x7f70af120700 (LWP 23209) "ceph-osd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  37   Thread 0x7f70ae91f700 (LWP 23210) "ceph-osd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  36   Thread 0x7f70ae11e700 (LWP 23211) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  35   Thread 0x7f70ad91d700 (LWP 23212) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  34   Thread 0x7f70ad11c700 (LWP 23213) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  33   Thread 0x7f70ac91b700 (LWP 23214) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  32   Thread 0x7f70ac11a700 (LWP 23215) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  31   Thread 0x7f70ab919700 (LWP 23216) "ceph-osd" 0x00007f70b642eb03 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  30   Thread 0x7f70ab118700 (LWP 23217) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  29   Thread 0x7f70aa917700 (LWP 23218) "ceph-osd" 0x00007f70b642eb03 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  28   Thread 0x7f70aa116700 (LWP 23219) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  27   Thread 0x7f70a9915700 (LWP 23220) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  26   Thread 0x7f70a9114700 (LWP 23221) "ceph-osd" 0x00007f70b642eb03 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  25   Thread 0x7f70a8913700 (LWP 23222) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  24   Thread 0x7f70a8112700 (LWP 23223) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  23   Thread 0x7f70a7911700 (LWP 23224) "ceph-osd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  22   Thread 0x7f70a7110700 (LWP 23225) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  21   Thread 0x7f70a690f700 (LWP 23226) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  20   Thread 0x7f70a610e700 (LWP 23227) "ceph-osd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  19   Thread 0x7f70a590d700 (LWP 23228) "ceph-osd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  18   Thread 0x7f70a510c700 (LWP 23229) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  17   Thread 0x7f70a490b700 (LWP 23230) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  16   Thread 0x7f70a410a700 (LWP 23231) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  15   Thread 0x7f70a3808700 (LWP 23268) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  14   Thread 0x7f70a3404700 (LWP 23284) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  13   Thread 0x7f70a3909700 (LWP 23346) "ceph-osd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  12   Thread 0x7f70b88bb700 (LWP 23347) "ceph-osd" 0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
  11   Thread 0x7f70a2bfc700 (LWP 23392) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  10   Thread 0x7f70a29fa700 (LWP 23539) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  9    Thread 0x7f70a3202700 (LWP 24062) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  8    Thread 0x7f70a3606700 (LWP 24513) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  7    Thread 0x7f70a3303700 (LWP 24679) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  6    Thread 0x7f70a22f3700 (LWP 24707) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  5    Thread 0x7f70a1bec700 (LWP 24768) "ceph-osd" 0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
  4    Thread 0x7f70a1aeb700 (LWP 24769) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  3    Thread 0x7f70a24f5700 (LWP 24794) "ceph-osd" 0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
  2    Thread 0x7f70a1eef700 (LWP 24797) "ceph-osd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
* 1    Thread 0x7f70b88d5780 (LWP 23104) "ceph-osd" 0x00007f70b82b1148 in pthread_join (threadid=140121899902720, thread_return=0x0) at pthread_join.c:89
(gdb) f 3
#3  0x000000000095c120 in SimpleMessenger::wait (this=0x24c9000) at msg/SimpleMessenger.cc:490
490     msg/SimpleMessenger.cc: No such file or directory.
(gdb) t 3
[Switching to thread 3 (Thread 0x7f70a24f5700 (LWP 24794))]
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
(gdb) bt
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
#1  0x00007f70b73698d2 in SpinLock::SlowLock() () from /usr/lib/libtcmalloc.so.0
#2  0x00007f70b735e0b4 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /usr/lib/libtcmalloc.so.0
#3  0x00007f70b7360cad in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /usr/lib/libtcmalloc.so.0
#4  0x00007f70b736fc76 in tc_new () from /usr/lib/libtcmalloc.so.0
#5  0x00007f70b6cb3a89 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f70b6cb47fb in std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f70b6cb48e0 in std::string::reserve(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f70b6cb4a3f in std::string::append(unsigned long, char) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x0000000000955295 in resize (__n=160, this=0x29cbcf0) at /usr/include/c++/4.6/bits/basic_string.h:749
#10 PrebufferedStreambuf::overflow (this=0x29cbca0, c=116) at common/PrebufferedStreambuf.cc:20
#11 0x00007f70b6cb2085 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f70b6ca9055 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x000000000089d8a4 in operator<< <std::char_traits<char> > (__s=0x2a084a0 "/tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x899cfa]", __out=...) at /usr/include/c++/4.6/ostream:515
#14 ceph::BackTrace::print (this=0x7f70a24f3640, out=...) at common/BackTrace.cc:60
#15 0x0000000000899e56 in handle_fatal_signal (signum=11) at global/signal_handler.cc:97
#16 <signal handler called>
#17 0x00007f70b735ddf7 in tcmalloc::CentralFreeList::FetchFromSpans() () from /usr/lib/libtcmalloc.so.0
#18 0x00007f70b735e167 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /usr/lib/libtcmalloc.so.0
#19 0x00007f70b7360cad in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /usr/lib/libtcmalloc.so.0
#20 0x00007f70b736fc76 in tc_new () from /usr/lib/libtcmalloc.so.0
#21 0x00007f70b6cb3a89 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007f70b6cb47fb in std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#23 0x00007f70b6cb48e0 in std::string::reserve(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#24 0x00007f70b6cb4a3f in std::string::append(unsigned long, char) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x0000000000955295 in resize (__n=160, this=0x26be670) at /usr/include/c++/4.6/bits/basic_string.h:749
#26 PrebufferedStreambuf::overflow (this=0x26be620, c=111) at common/PrebufferedStreambuf.cc:20
#27 0x00007f70b6cb2085 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#28 0x00007f70b6ca9055 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#29 0x00007f70b6ca92ef in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#30 0x00000000006b9445 in MOSDPGLog::print (this=0x4c6d000, out=...) at ./messages/MOSDPGLog.h:57
#31 0x0000000000a7c5c2 in operator<< (m=..., out=...) at msg/Message.h:480
#32 Pipe::reader (this=0x47e0000) at msg/Pipe.cc:1296
#33 0x0000000000a7e785 in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#34 0x00000000009665e2 in Thread::_entry_func (arg=<optimized out>) at common/Thread.cc:41
#35 0x00007f70b82afe9a in start_thread (arg=0x7f70a24f5700) at pthread_create.c:308
#36 0x00007f70b643a4bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#37 0x0000000000000000 in ?? ()
(gdb) t 3
[Switching to thread 3 (Thread 0x7f70a24f5700 (LWP 24794))]
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
(gdb) t 5
[Switching to thread 5 (Thread 0x7f70a1bec700 (LWP 24768))]
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
(gdb) bt
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
#1  0x00007f70b73698d2 in SpinLock::SlowLock() () from /usr/lib/libtcmalloc.so.0
#2  0x00007f70b735e0b4 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /usr/lib/libtcmalloc.so.0
#3  0x00007f70b7360cad in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /usr/lib/libtcmalloc.so.0
#4  0x00007f70b7370b26 in tc_newarray () from /usr/lib/libtcmalloc.so.0
#5  0x0000000000982457 in raw_char (l=177, this=<optimized out>) at common/buffer.cc:196
#6  ceph::buffer::create (len=177) at common/buffer.cc:231
#7  0x0000000000a6210b in Pipe::read_message (this=0x25f3b40, pm=0x7f70a1bebda0) at msg/Pipe.cc:1580
#8  0x0000000000a7b4aa in Pipe::reader (this=0x25f3b40) at msg/Pipe.cc:1272
#9  0x0000000000a7e785 in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47
#10 0x00000000009665e2 in Thread::_entry_func (arg=<optimized out>) at common/Thread.cc:41
#11 0x00007f70b82afe9a in start_thread (arg=0x7f70a1bec700) at pthread_create.c:308
#12 0x00007f70b643a4bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#13 0x0000000000000000 in ?? ()
(gdb) t 51
[Switching to thread 51 (Thread 0x7f70b592d700 (LWP 23113))]
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
(gdb) bt
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
#1  0x00007f70b73698d2 in SpinLock::SlowLock() () from /usr/lib/libtcmalloc.so.0
#2  0x00007f70b735e0b4 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /usr/lib/libtcmalloc.so.0
#3  0x00007f70b7360cad in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /usr/lib/libtcmalloc.so.0
#4  0x00007f70b736fc76 in tc_new () from /usr/lib/libtcmalloc.so.0
#5  0x00007f70b6cb3a89 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f70b6cb47fb in std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f70b6cb48e0 in std::string::reserve(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f70b6cb4a3f in std::string::append(unsigned long, char) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x0000000000955295 in resize (__n=160, this=0x297c670) at /usr/include/c++/4.6/bits/basic_string.h:749
#10 PrebufferedStreambuf::overflow (this=0x297c620, c=52) at common/PrebufferedStreambuf.cc:20
#11 0x00007f70b6cb2085 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f70b6ca9055 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f70b6ca92ef in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x0000000000981e45 in ceph::__ceph_assert_fail (assertion=0xb69f78 "0 == \"hit suicide timeout\"", file=<optimized out>, line=78, func=0xb6a100 "bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:69
#15 0x00000000008b9e15 in ceph::HeartbeatMap::_check (this=<optimized out>, h=<optimized out>, who=0xb69fed "is_healthy", now=<optimized out>) at common/HeartbeatMap.cc:78
#16 0x00000000008baf3f in ceph::HeartbeatMap::is_healthy (this=0x24ae0e0) at common/HeartbeatMap.cc:118
#17 0x00000000008bb34b in ceph::HeartbeatMap::check_touch_file (this=0x24ae0e0) at common/HeartbeatMap.cc:129
#18 0x0000000000994b9d in CephContextServiceThread::entry (this=0x24ac1c0) at common/ceph_context.cc:68
#19 0x00000000009665e2 in Thread::_entry_func (arg=<optimized out>) at common/Thread.cc:41
#20 0x00007f70b82afe9a in start_thread (arg=0x7f70b592d700) at pthread_create.c:308
#21 0x00007f70b643a4bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#22 0x0000000000000000 in ?? ()
(gdb) t 52
[Switching to thread 52 (Thread 0x7f70b612e700 (LWP 23111))]
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
(gdb) bt
#0  0x00007f70b73699e3 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib/libtcmalloc.so.0
#1  0x00007f70b73698d2 in SpinLock::SlowLock() () from /usr/lib/libtcmalloc.so.0
#2  0x00007f70b735dd27 in tcmalloc::CentralFreeList::InsertRange(void*, void*, int) () from /usr/lib/libtcmalloc.so.0
#3  0x00007f70b7360e77 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.0
#4  0x00007f70b7360ec3 in tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long) () from /usr/lib/libtcmalloc.so.0
#5  0x00007f70b73703ab in tc_delete () from /usr/lib/libtcmalloc.so.0
#6  0x00000000008c4397 in _M_dispose (__a=..., this=<optimized out>) at /usr/include/c++/4.6/bits/basic_string.h:244
#7  ~basic_string (this=0x29cfe90, __in_chrg=<optimized out>) at /usr/include/c++/4.6/bits/basic_string.h:534
#8  ~PrebufferedStreambuf (this=0x29cfe40, __in_chrg=<optimized out>) at ./common/PrebufferedStreambuf.h:17
#9  ~Entry (this=0x29cfdd0, __in_chrg=<optimized out>) at log/Entry.h:17
#10 ceph::log::Log::flush (this=0x24b3000) at log/Log.cc:178
#11 0x00000000008c4488 in ceph::log::Log::entry (this=0x24b3000) at log/Log.cc:313
#12 0x00000000009665e2 in Thread::_entry_func (arg=<optimized out>) at common/Thread.cc:41
#13 0x00007f70b82afe9a in start_thread (arg=0x7f70b612e700) at pthread_create.c:308
#14 0x00007f70b643a4bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()

job
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-12-25_03:00:02-regression-master-testing-gcov/27526$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: ec18aeecd4de479601363849d489668d8f12410c
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        rbd cache: true
      global:
        ms inject socket failures: 5000
    coverage: true
    fs: xfs
    log-whitelist:
    - slow request
    sha1: d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd
  s3tests:
    branch: master
  workunit:
    sha1: d9673ca3248d729c8e4f3c47db7a9ebb07ee97fd
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 2000

Actions #3

Updated by Sage Weil over 11 years ago

this reproduced once out of ~60 runs on the fsx task.

ubuntu@teuthology:/a/sage-tc1/28508

2012-12-27 17:04:39.112558 7f9fdd525700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9fdd525700

 ceph version 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x89b36a]
 2: (()+0xfcb0) [0x7f9fe96c2cb0]
 3: (tc_new()+0xfe) [0x7f9fe877a8ee]
 4: (std::vector<osd_xinfo_t, std::allocator<osd_xinfo_t> >::_M_fill_insert(__gnu_cxx::__normal_iterator<osd_xinfo_t*, std::vector<osd_xinfo_t, std::allocator<osd_xinfo_t> > >, unsigned long, osd_xinfo_t const&)+0x370) [0x94f6c0]
 5: (OSDMap::decode(ceph::buffer::list::iterator&)+0x13fe) [0x944fbe]
 6: (OSDMap::decode(ceph::buffer::list&)+0x4e) [0x94546e]
 7: (OSD::handle_osd_map(MOSDMap*)+0x1e0a) [0x6b319a]
 8: (OSD::_dispatch(Message*)+0x3d3) [0x6b67e3]
 9: (OSD::ms_dispatch(Message*)+0x288) [0x6b73a8]
 10: (Messenger::ms_deliver_dispatch(Message*)+0x86) [0xa52596]
 11: (DispatchQueue::entry()+0x270) [0xa50420]
 12: (DispatchQueue::DispatchThread::entry()+0x15) [0x964515]
 13: (Thread::_entry_func(void*)+0x12) [0x967b12]
 14: (()+0x7e9a) [0x7f9fe96bae9a]
 15: (clone()+0x6d) [0x7f9fe78454bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

will see if i can trigger it on some older code.

Actions #4

Updated by Josh Durgin over 11 years ago

Possibly the same bug in teuthology:/a/joshd-3631-12-28-12_08.55/30739

 ceph version 0.55.1-369-g942c714 (942c71454b11d3bfb09ee4abe65f1e40936d3bf6)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x1069a83]
 2: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x10cef66]
 3: (()+0xfcb0) [0x7f64ed407cb0]
 4: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7f64ec4addf7]
 5: (tcmalloc::CentralFreeList::FetchFromSpansSafe()+0x9) [0x7f64ec4ae039]
 6: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x77) [0x7f64ec4ae0d7]
 7: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7f64ec4b0cad]
 8: (tc_new()+0x486) [0x7f64ec4bfc76]
 9: (__gnu_cxx::new_allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > >::allocate(unsigned lo
ng, void const*)+0x43) [0xf1d5d1]
 10: (std::_Vector_base<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_
notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >::_M_allocate(unsigned long)+0x2f) [0xf09223]
 11: (std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify
_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >::_M_fill_insert(__gnu_cxx::__normal_iterator<std::pair<pg_notify_t, 
std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >*, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, s
td::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocato
r<std::pair<unsigned int const, pg_interval_t> > > > > > >, unsigned long, std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const,
 pg_interval_t> > > > const&)+0x355) [0xf0ef09]
 12: (std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify
_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >::insert(__gnu_cxx::__normal_iterator<std::pair<pg_notify_t, std::map
<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >*, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less
<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::p
air<unsigned int const, pg_interval_t> > > > > > >, unsigned long, std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_inte
rval_t> > > > const&)+0x30) [0xefc2e0]
 13: (std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify
_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >::resize(unsigned long, std::pair<pg_notify_t, std::map<unsigned int,
 pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >)+0x67) [0xeec685]
 14: (MOSDPGInfo::decode_payload()+0x91) [0xee3539]
 15: (decode_message(CephContext*, ceph_msg_header&, ceph_msg_footer&, ceph::buffer::list&, ceph::buffer::list&, ceph::buffer::list&)+0x1d9f) [0x10f1943]
 16: (Pipe::read_message(Message**)+0x1c7c) [0x128eaf6]
 17: (Pipe::reader()+0xa24) [0x128aa52]
 18: (Pipe::Reader::entry()+0x1c) [0x1291b22]
 19: (Thread::_entry_func(void*)+0x23) [0x119b399]
 20: (()+0x7e9a) [0x7f64ed3ffe9a]
 21: (clone()+0x6d) [0x7f64eb58a4bd]

Job was:

nuke-on-error: true
overrides:
  ceph:
    branch: next
    conf:
      client:
        debug ms: 1
      global:
        ms inject socket failures: 5000
      osd:
        debug ms: 1
    fs: xfs
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      snap_create: 50
      snap_remove: 50
      snap_rollback: 50
      write: 100
    ops: 4000

I'm adding debug osd = 20 to future runs.

Actions #5

Updated by Sage Weil over 11 years ago

another one with full osd logs:


ubuntu@teuthology:/a/sage-a1/33277$ zcat remote/*/log/osd.1* | tail -n 50
2013-01-02 00:39:56.943383 7f3d7c665700 10 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]] put_object_context 0x2942600 955f9c7f/plana6421865-38/head//0 1 -> 0
2013-01-02 00:39:56.943405 7f3d7c665700 15 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]]  requeue_ops 
2013-01-02 00:39:56.943426 7f3d7c665700 10 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]] put_snapset_context plana6421865-38 1 -> 0
2013-01-02 00:39:56.943467 7f3d7c665700 10 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]] eval_repop repgather(0x604f120 applied 762'339 rep_tid=1978 wfack= wfdisk= op=osd_op(client.4118.0:2702 plana6421865-38 [write 799931~481579] 0.955f9c7f snapc 1d6=[1d6,1d5,1d1,1d0,1cf,1cb,1c9,1c7,1c6,1c2,1bd,1bb,1b9,1b8,1b7,1ae,1ac,1a9,1a5,1a3,1a2,19f,192,185,17c,178,15f,129]) v4) wants=ad
2013-01-02 00:39:56.943516 7f3d7c665700 15 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]] log_op_stats osd_op(client.4118.0:2702 plana6421865-38 [write 799931~481579] 0.955f9c7f snapc 1d6=[1d6,1d5,1d1,1d0,1cf,1cb,1c9,1c7,1c6,1c2,1bd,1bb,1b9,1b8,1b7,1ae,1ac,1a9,1a5,1a3,1a2,19f,192,185,17c,178,15f,129]) v4 inb 481703 outb 0 rlat 0.023600 lat 0.049485
2013-01-02 00:39:56.943564 7f3d7c665700 15 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]] update_stats 707'116
2013-01-02 00:39:56.943590 7f3d7c665700 10 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]]  removing repgather(0x604f120 applied 762'339 rep_tid=1978 wfack= wfdisk= op=osd_op(client.4118.0:2702 plana6421865-38 [write 799931~481579] 0.955f9c7f snapc 1d6=[1d6,1d5,1d1,1d0,1cf,1cb,1c9,1c7,1c6,1c2,1bd,1bb,1b9,1b8,1b7,1ae,1ac,1a9,1a5,1a3,1a2,19f,192,185,17c,178,15f,129]) v4)
2013-01-02 00:39:56.943631 7f3d7c665700 20 osd.1 pg_epoch: 762 pg[0.f( v 762'339 (0'0,762'339] local-les=761 n=18 ec=1 les/c 761/731 761/761/707) [1,4] r=0 lpr=761 pi=708-760/3 mlcod 662'245 active+recovering snaptrimq=[145~1,16d~1,186~1,1d2~3]]    q front is repgather(0x604f120 applied 762'339 rep_tid=1978 wfack= wfdisk= op=osd_op(client.4118.0:2702 plana6421865-38 [write 799931~481579] 0.955f9c7f snapc 1d6=[1d6,1d5,1d1,1d0,1cf,1cb,1c9,1c7,1c6,1c2,1bd,1bb,1b9,1b8,1b7,1ae,1ac,1a9,1a5,1a3,1a2,19f,192,185,17c,178,15f,129]) v4)
2013-01-02 00:39:56.967406 7f3d7c665700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f3d7c665700

 ceph version 0.56-103-gf1196c7 (f1196c7e93af83405ea5082030a7899e256ded7a)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x106865b]
 2: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x10cdb46]
 3: (()+0xfcb0) [0x7f3d86fffcb0]
 4: (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0xeb) [0x7f3d860a8e3b]
 5: (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long)+0x23) [0x7f3d860a8ec3]
 6: (operator delete(void*)+0x28b) [0x7f3d860b83ab]
 7: (__gnu_cxx::new_allocator<pg_log_entry_t>::deallocate(pg_log_entry_t*, unsigned long)+0x20) [0xe72b4a]
 8: (std::_Vector_base<pg_log_entry_t, std::allocator<pg_log_entry_t> >::_M_deallocate(pg_log_entry_t*, unsigned long)+0x32) [0xe5d8c2]
 9: (std::_Vector_base<pg_log_entry_t, std::allocator<pg_log_entry_t> >::~_Vector_base()+0x56) [0xe4ae22]
 10: (std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >::~vector()+0x42) [0xe3b724]
 11: (ReplicatedPG::OpContext::~OpContext()+0x8f) [0xe32953]
 12: (ReplicatedPG::RepGather::put()+0xbe) [0xe32e1a]
 13: (ReplicatedPG::op_applied(ReplicatedPG::RepGather*)+0x964) [0xdfe740]
 14: (C_OSD_OpApplied::finish(int)+0x2a) [0xe36de2]
 15: (Finisher::finisher_thread_entry()+0x342) [0x10dd946]
 16: (Finisher::FinisherThread::entry()+0x1c) [0xeda216]
 17: (Thread::_entry_func(void*)+0x23) [0x11983b5]
 18: (()+0x7e9a) [0x7f3d86ff7e9a]
 19: (clone()+0x6d) [0x7f3d851824bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #6

Updated by Sage Weil over 11 years ago

with glibc malloc and debug enabled:

ubuntu@teuthology:/a/sage-a2/33535$ zless remote/*/log/osd.4*

ubuntu@teuthology:/a/sage-a2/33564$ zless remote/*/log/osd.1.*

2013-01-02 14:15:31.545161 7f82d84f5700  1 -- 10.214.132.34:6801/6641 >> :/0 pipe(0x7f8280002670 sd=45 :6801 pgs=0 cs=0 l=0).accept sd=45 10.214.132.35:48440/0
2013-01-02 14:15:31.545356 7f82cc7f0700 10 osd.1 1288 OSD::ms_get_authorizer type=osd
2013-01-02 14:15:31.545587 7f82d84f5700 10 osd.1 1288  new session 0x7f81e0006040 con=0x7f8280001070 addr=10.214.132.35:6807/4044
2013-01-02 14:15:31.545779 7f82d84f5700 10 osd.1 1288  session 0x7f81e0006040 osd.3 has caps osdcap[grant(*)] 'allow *'
2013-01-02 14:15:31.545803 7f82d84f5700  0 -- 10.214.132.34:6801/6641 >> 10.214.132.35:6807/4044 pipe(0x7f8280002670 sd=45 :6801 pgs=0 cs=0 l=0).accept connect_seq 76 vs existing 76 state connecting
2013-01-02 14:15:31.546054 7f82d37fe700  1 -- 10.214.132.34:6801/6641 <== osd.3 10.214.132.35:6807/4044 3371 ==== pg_info(1 pgs e1288:1.8) v3 ==== 516+0+0 (3272433763 0 0) 0x7f81e00048d0 con 0x7f82b800ce60
2013-01-02 14:15:31.546174 7f82d37fe700  7 osd.1 1288 handle_pg_info pg_info(1 pgs e1288:1.8) v3 from osd.3
2013-01-02 14:15:31.546212 7f82d37fe700 10 osd.1 pg_epoch: 1288 pg[1.8( empty local-les=1288 n=0 ec=1 les/c 1270/1270 1287/1287/1287) [1,3] r=0 lpr=1287 pi=1266-1286/2 mlcod 0'0 active] info 1.8( empty local-les=1288 n=0 ec=1 les/c 1288/1270 1287/1287/1287) from osd.3
2013-01-02 14:15:31.546418 7f82d84f5700  0 -- 10.214.132.34:6801/6641 >> 10.214.132.35:6807/4044 pipe(0x7f8280002670 sd=45 :6801 pgs=316 cs=77 l=0).reader got old message 3371 <= 3371 0x7f81e0004f70 pg_info(1 pgs e1288:1.8) v3, discarding
2013-01-02 14:15:31.548300 7f82d84f5700  0 -- 10.214.132.34:6801/6641 >> 10.214.132.35:6807/4044 pipe(0x7f8280002670 sd=45 :6801 pgs=316 cs=77 l=0).fault, initiating reconnect
...
2013-01-02 14:15:31.570776 7f82cf7f6700 10 osd.1 pg_epoch: 1288 pg[0.2( v 1268'248 lc 1237'243 (0'0,1268'248] local-les=1273 n=6 ec=1 les/c 1273/1234 1273/1273/1269) [1,3] r=0 lpr=1273 pi=1229-1272/6 lcod 1234'241 mlcod 0'0 active+recovering m=2 snaptrimq=[2d3~1,2d8~1,31c~1,320~1,32d~1,332~1,33f~3,343~1,346~4,34c~1,34e~1]] state<Started/Primary/Active>: Active: got notify from 5, already have info from that osd, ignoring2013-01-02 14:15:31.570764 7f82d84f5700 -1 
*** Caught signal (Aborted) **
 in thread 7f82d84f5700

 ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x106924b]
 2: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x10ce736]
 3: (()+0xfcb0) [0x7f82f1c05cb0]
 4: (gsignal()+0x35) [0x7f82eff39445]
 5: (abort()+0x17b) [0x7f82eff3cbab]
 6: (()+0x73e2e) [0x7f82eff76e2e]
 7: (()+0x7e626) [0x7f82eff81626]
 8: (__gnu_cxx::new_allocator<std::_List_node<ceph::buffer::ptr> >::deallocate(std::_List_node<ceph::buffer::ptr>*, unsigned long)+0x20) [0xdd2630]
 9: (std::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >::_M_put_node(std::_List_node<ceph::buffer::ptr>*)+0x28) [0xdd1ab4]
 10: (std::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >::_M_clear()+0x81) [0xdd1193]
 11: (std::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >::~_List_base()+0x19) [0xdd0a2d]
 12: (std::list<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >::~list()+0x18) [0xdcfaa4]
 13: (ceph::buffer::list::~list()+0x29) [0xdcfb53]
 14: (Pipe::write_message(Message*)+0x10d8) [0x12905d8]
 15: (Pipe::writer()+0xc7e) [0x128bc28]
 16: (Pipe::Writer::entry()+0x1c) [0x1291372]
 17: (Thread::_entry_func(void*)+0x23) [0x119ab89]
 18: (()+0x7e9a) [0x7f82f1bfde9a]
 19: (clone()+0x6d) [0x7f82efff54bd]

identical fault pattern in accept() and then backtrace for both of them. the intervening activity is also surprisingly similar, but not identical.

Actions #7

Updated by Sage Weil over 11 years ago

  • Assignee set to Sage Weil
Actions #8

Updated by Sage Weil over 11 years ago

got msgr logs in ubuntu@teuthology:/a/sage-a3/34724, but the crash looked different from the earlier ones (whose logs have sadly been deleted). have not yet inspected the core.

Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to Fix Under Review

YAY, wip-3678 is consistently passing now.

Actions #10

Updated by Sage Weil over 11 years ago

  • Assignee changed from Sage Weil to Greg Farnum
Actions #12

Updated by Sage Weil over 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF