Project

General

Profile

Actions

Bug #13749

closed

possible osd messenger deadlock "hit suicide timeout" in rados-infernalis-distro-basic-multi

Added by Yuri Weinstein over 8 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2015-11-04_21:00:09-rados-infernalis-distro-basic-multi/
Job: 1138922
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-11-04_21:00:09-rados-infernalis-distro-basic-multi/1138922/teuthology.log

2015-11-09T01:03:30.989 INFO:tasks.ceph.osd.0.plana44.stderr:2015-11-09 01:03:30.980798 7fe932cdb700 -1 osd.0 54 heartbeat_check: no reply from osd.5 since back 2015-11-09 01:01:23.162857 front 2015-11-09 01:01:23.162857 (cutoff 2015-11-09 01:03:10.980796)
2015-11-09T01:03:32.782 INFO:teuthology.orchestra.run.plana44:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd dump --format=json'
2015-11-09T01:03:33.890 INFO:tasks.ceph.osd.0.plana44.stderr:2015-11-09 01:03:33.881198 7fe932cdb700 -1 osd.0 54 heartbeat_check: no reply from osd.5 since back 2015-11-09 01:01:23.162857 front 2015-11-09 01:01:23.162857 (cutoff 2015-11-09 01:03:13.881196)
2015-11-09T01:03:34.990 INFO:tasks.ceph.osd.0.plana44.stderr:2015-11-09 01:03:34.981612 7fe932cdb700 -1 osd.0 54 heartbeat_check: no reply from osd.5 since back 2015-11-09 01:01:23.162857 front 2015-11-09 01:01:23.162857 (cutoff 2015-11-09 01:03:14.981611)
2015-11-09T01:03:35.230 INFO:tasks.ceph.osd.0.plana44.stderr:common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe9507ea700 time 2015-11-09 01:03:35.218348
2015-11-09T01:03:35.230 INFO:tasks.ceph.osd.0.plana44.stderr:common/HeartbeatMap.cc: 81: FAILED assert(0 == "hit suicide timeout")
2015-11-09T01:03:35.243 INFO:tasks.ceph.osd.0.plana44.stderr: ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
2015-11-09T01:03:35.244 INFO:tasks.ceph.osd.0.plana44.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fe955f8f99b]
2015-11-09T01:03:35.244 INFO:tasks.ceph.osd.0.plana44.stderr: 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2a9) [0x7fe955ed5a79]
2015-11-09T01:03:35.244 INFO:tasks.ceph.osd.0.plana44.stderr: 3: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x7fe955ed6236]
2015-11-09T01:03:35.244 INFO:tasks.ceph.osd.0.plana44.stderr: 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7fe955ed69f7]
2015-11-09T01:03:35.244 INFO:tasks.ceph.osd.0.plana44.stderr: 5: (CephContextServiceThread::entry()+0x14b) [0x7fe955fab0cb]
2015-11-09T01:03:35.244 INFO:tasks.ceph.osd.0.plana44.stderr: 6: (()+0x8182) [0x7fe9545af182]
2015-11-09T01:03:35.244 INFO:tasks.ceph.osd.0.plana44.stderr: 7: (clone()+0x6d) [0x7fe9528f738d]
2015-11-09T01:03:35.245 INFO:tasks.ceph.osd.0.plana44.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2015-11-09T01:03:35.245 INFO:tasks.ceph.osd.0.plana44.stderr:2015-11-09 01:03:35.234727 7fe9507ea700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe9507ea700 time 2015-11-09 01:03:35.218348
2015-11-09T01:03:35.245 INFO:tasks.ceph.osd.0.plana44.stderr:common/HeartbeatMap.cc: 81: FAILED assert(0 == "hit suicide timeout")
2015-11-09T01:03:35.245 INFO:tasks.ceph.osd.0.plana44.stderr:
2015-11-09T01:03:35.245 INFO:tasks.ceph.osd.0.plana44.stderr: ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
2015-11-09T01:03:35.245 INFO:tasks.ceph.osd.0.plana44.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fe955f8f99b]
2015-11-09T01:03:35.246 INFO:tasks.ceph.osd.0.plana44.stderr: 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2a9) [0x7fe955ed5a79]
2015-11-09T01:03:35.246 INFO:tasks.ceph.osd.0.plana44.stderr: 3: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x7fe955ed6236]
2015-11-09T01:03:35.246 INFO:tasks.ceph.osd.0.plana44.stderr: 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7fe955ed69f7]
2015-11-09T01:03:35.246 INFO:tasks.ceph.osd.0.plana44.stderr: 5: (CephContextServiceThread::entry()+0x14b) [0x7fe955fab0cb]
2015-11-09T01:03:35.246 INFO:tasks.ceph.osd.0.plana44.stderr: 6: (()+0x8182) [0x7fe9545af182]
2015-11-09T01:03:35.246 INFO:tasks.ceph.osd.0.plana44.stderr: 7: (clone()+0x6d) [0x7fe9528f738d]
2015-11-09T01:03:35.247 INFO:tasks.ceph.osd.0.plana44.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Files

backtraces.txt (87.6 KB) backtraces.txt Samuel Just, 11/10/2015 05:51 PM
Actions #1

Updated by Yuri Weinstein over 8 years ago

  • Subject changed from "hit suicide timeout" in ados-infernalis-distro-basic-multi to "hit suicide timeout" in rados-infernalis-distro-basic-multi
Actions #2

Updated by Samuel Just over 8 years ago

Looks deadlocky:

Thread 104 (Thread 0x7fe91ce79700 (LWP 7571)):
#0 lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fe9545b1657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fe9545b1480 in __GI
_pthread_mutex_lock (mutex=0x7fe95bc21160) at ../nptl/pthread_mutex_lock.c:79
#3 0x00007fe955f42a08 in Mutex::Lock (this=this@entry=0x7fe95bc21150, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:102
#4 0x00007fe955f672d3 in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:115
#5 is_connected (this=0x7fe95bc1d000) at msg/simple/Pipe.h:144
#6 SimpleMessenger::is_connected (this=<optimized out>, con=<optimized out>) at msg/simple/SimpleMessenger.cc:278
#7 0x00007fe9559d5fd6 in OSD::op_is_discardable (op=op@entry=0x7fe95b39ec00) at osd/OSD.cc:8286
#8 0x00007fe955a0966a in OSD::handle_op (this=this@entry=0x7fe95a960000, op=..., osdmap=...) at osd/OSD.cc:8092
#9 0x00007fe955a0ae7e in OSD::dispatch_op_fast (this=this@entry=0x7fe95a960000, op=..., osdmap=...) at osd/OSD.cc:5840
#10 0x00007fe955a0b138 in OSD::dispatch_session_waiting (this=this@entry=0x7fe95a960000, session=session@entry=0x7fe95ab03400, osdmap=...) at osd/OSD.cc:5486
#11 0x00007fe955a0b483 in OSD::ms_fast_dispatch (this=0x7fe95a960000, m=<optimized out>) at osd/OSD.cc:5596
#12 0x00007fe956044466 in ms_fast_dispatch (m=0x7fe95b39ec00, this=0x7fe95a862000) at ./msg/Messenger.h:554
#13 DispatchQueue::fast_dispatch (this=0x7fe95a8621c8, m=m@entry=0x7fe95b39ec00) at msg/simple/DispatchQueue.cc:71
#14 0x00007fe956060ae2 in Pipe::DelayedDelivery::entry (this=0x7fe95aacdde0) at msg/simple/Pipe.cc:221
#15 0x00007fe9545af182 in start_thread (arg=0x7fe91ce79700) at pthread_create.c:312
#16 0x00007fe9528f738d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 79 (Thread 0x7fe931cd9700 (LWP 5293)):
#0 lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fe9545b1657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fe9545b1480 in __GI
_pthread_mutex_lock (mutex=0x7fe95ab7ecb8) at ../nptl/pthread_mutex_lock.c:79
#3 0x00007fe955f42a08 in Mutex::Lock (this=this@entry=0x7fe95ab7eca8, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:102
#4 0x00007fe955b0dd00 in PG::lock (this=0x7fe95ab7e800, no_lockdep=no_lockdep@entry=false) at osd/PG.cc:248
#5 0x00007fe955c1e4b8 in C_Copyfrom::finish (this=0x7fe95b6fed90, r=0) at osd/ReplicatedPG.cc:6527
#6 0x00007fe955a281a9 in Context::complete (this=0x7fe95b6fed90, r=<optimized out>) at ./include/Context.h:64
#7 0x00007fe955ec6328 in Finisher::finisher_thread_entry (this=0x7fe95a962128) at common/Finisher.cc:65
#8 0x00007fe9545af182 in start_thread (arg=0x7fe931cd9700) at pthread_create.c:312
#9 0x00007fe9528f738d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 56 (Thread 0x7fe935ce1700 (LWP 5278)):
#0 lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fe9545b1657 in L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fe9545b1480 in __GI
_pthread_mutex_lock (mutex=0x7fe95bc21160) at ../nptl/pthread_mutex_lock.c:79
#3 0x00007fe955f42a08 in Mutex::Lock (this=this@entry=0x7fe95bc21150, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:102
#4 0x00007fe955f672d3 in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:115
#5 is_connected (this=0x7fe95bc1d000) at msg/simple/Pipe.h:144
#6 SimpleMessenger::is_connected (this=<optimized out>, con=<optimized out>) at msg/simple/SimpleMessenger.cc:278
#7 0x00007fe9559d5fd6 in OSD::op_is_discardable (op=op@entry=0x7fe95af3eec0) at osd/OSD.cc:8286
#8 0x00007fe955b1f489 in PG::can_discard_op (this=0x7fe95ab7e800, op=...) at osd/PG.cc:5051
#9 0x00007fe955b1fcb5 in PG::can_discard_request (this=this@entry=0x7fe95ab7e800, op=...) at osd/PG.cc:5153
#10 0x00007fe955b8b3a4 in ReplicatedPG::do_request (this=0x7fe95ab7e800, op=..., handle=...) at osd/ReplicatedPG.cc:1359
#11 0x00007fe9559fe43d in OSD::dequeue_op (this=0x7fe95a960000, pg=..., op=..., handle=...) at osd/OSD.cc:8478
#12 0x00007fe9559fe65d in PGQueueable::RunVis::operator() (this=this@entry=0x7fe935ce06e0, op=...) at osd/OSD.cc:158
#13 0x00007fe955a23614 in internal_visit<std::shared_ptr<OpRequest> > (operand=..., this=<synthetic pointer>) at /usr/include/boost/variant/variant.hpp:1017
#14 visitation_impl_invoke_impl<boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*, std::shared_ptr<OpRequest> > (storage=0x7fe935ce08d0, visitor=<synthetic pointer>)
at /usr/include/boost/variant/detail/visitation_impl.hpp:130
#15 visitation_impl_invoke<boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*, std::shared_ptr<OpRequest>, boost::variant<std::shared_ptr<OpRequest>, PGSnapTrim, PGScrub>::has_fallback_type
> (internal_which=<optimized out>, t=0x0, storage=0x7fe935ce08d0, visitor=<synthetic pointer>) at /usr/include/boost/variant/detail/visitation_impl.hpp:173
#16 visitation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<3l>, std::shared_ptr<OpRequest>, boost::mpl::l_item<mpl_::long_<2l>, PGSnapTrim, boost::mpl::l_item<mpl_::long_<1l>, PGScrub, boost::mpl::l_end> > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*, boost::variant<std::shared_ptr<OpRequest>, PGSnapTrim, PGScrub>::has_fallback_type_> (no_backup_flag=..., storage=0x7fe935ce08d0, visitor=<synthetic pointer>, logical_which=<optimized out>,
internal_which=<optimized out>) at /usr/include/boost/variant/detail/visitation_impl.hpp:256
#17 internal_apply_visitor_impl<boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*> (storage=0x7fe935ce08d0, visitor=<synthetic pointer>, logical_which=<optimized out>,
internal_which=<optimized out>) at /usr/include/boost/variant/variant.hpp:2326
#18 internal_apply_visitor<boost::detail::variant::invoke_visitor<PGQueueable::RunVis> > (visitor=<synthetic pointer>, this=0x7fe935ce08c8) at /usr/include/boost/variant/variant.hpp:2337
#19 apply_visitor<PGQueueable::RunVis> (visitor=..., this=0x7fe935ce08c8) at /usr/include/boost/variant/variant.hpp:2360
#20 apply_visitor<PGQueueable::RunVis, boost::variant<std::shared_ptr<OpRequest>, PGSnapTrim, PGScrub> > (visitable=..., visitor=...) at /usr/include/boost/variant/detail/apply_visitor_unary.hpp:60
#21 run (handle=..., pg=..., osd=<optimized out>, this=0x7fe935ce08c8) at osd/OSD.h:379
#22 OSD::ShardedOpWQ::_process (this=0x7fe95a961140, thread_index=<optimized out>, hb=<optimized out>) at osd/OSD.cc:8366
#23 0x00007fe955f803bf in ShardedThreadPool::shardedthreadpool_worker (this=0x7fe95a960758, thread_index=<optimized out>) at common/WorkQueue.cc:340
#24 0x00007fe955f822c0 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at common/WorkQueue.h:592
#25 0x00007fe9545af182 in start_thread (arg=0x7fe935ce1700) at pthread_create.c:312
#26 0x00007fe9528f738d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 47 (Thread 0x7fe9434fc700 (LWP 5207)):
#0 lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fe9545b1657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fe9545b1480 in __GI
_pthread_mutex_lock (mutex=0x7fe95a960028) at ../nptl/pthread_mutex_lock.c:79
#3 0x00007fe955f42a08 in Mutex::Lock (this=this@entry=0x7fe95a960018, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:102
#4 0x00007fe955a22a92 in OSD::ms_dispatch (this=0x7fe95a960000, m=0x7fe95aff2800) at osd/OSD.cc:5456
#5 0x00007fe9560458ea in ms_deliver_dispatch (m=0x7fe95aff2800, this=0x7fe95a862000) at ./msg/Messenger.h:583
#6 DispatchQueue::entry (this=0x7fe95a8621c8) at msg/simple/DispatchQueue.cc:185
#7 0x00007fe955f701ed in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:103
#8 0x00007fe9545af182 in start_thread (arg=0x7fe9434fc700) at pthread_create.c:312
#9 0x00007fe9528f738d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

#2 0x00007fe9545b1480 in GI_pthread_mutex_lock (mutex=0x7fe95ab7ecb8) at ../nptl/pthread_mutex_lock.c:79
#3 0x00007fe955f42a08 in Mutex::Lock (this=this@entry=0x7fe95ab7eca8, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:102
#4 0x00007fe955b0dd00 in PG::lock (this=this@entry=0x7fe95ab7e800, no_lockdep=no_lockdep@entry=false) at osd/PG.cc:248
#5 0x00007fe955a14a1b in OSD::consume_map (this=this@entry=0x7fe95a960000) at osd/OSD.cc:6754
#6 0x00007fe955a18c30 in OSD::handle_osd_map (this=this@entry=0x7fe95a960000, m=m@entry=0x7fe95b6978c0) at osd/OSD.cc:6531
#7 0x00007fe955a22621 in OSD::_dispatch (this=this@entry=0x7fe95a960000, m=m@entry=0x7fe95b6978c0) at osd/OSD.cc:5910
#8 0x00007fe955a22c6f in OSD::ms_dispatch (this=0x7fe95a960000, m=0x7fe95b6978c0) at osd/OSD.cc:5470
#9 0x00007fe9560458ea in ms_deliver_dispatch (m=0x7fe95b6978c0, this=0x7fe95a862e00) at ./msg/Messenger.h:583
#10 DispatchQueue::entry (this=0x7fe95a862fc8) at msg/simple/DispatchQueue.cc:185
#11 0x00007fe955f701ed in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:103
#12 0x00007fe9545af182 in start_thread (arg=0x7fe941cf9700) at pthread_create.c:312
#13 0x00007fe9528f738d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:1

Thread 19 (Thread 0x7fe9384e6700 (LWP 5269)):
#0 lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fe9545b1657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fe9545b1480 in __GI
_pthread_mutex_lock (mutex=0x7fe95ab7ecb8) at ../nptl/pthread_mutex_lock.c:79
#3 0x00007fe955f42a08 in Mutex::Lock (this=this@entry=0x7fe95ab7eca8, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:102
#4 0x00007fe955b0dd00 in PG::lock (this=this@entry=0x7fe95ab7e800, no_lockdep=no_lockdep@entry=false) at osd/PG.cc:248
#5 0x00007fe955b0de7e in PG::lock_suspend_timeout (this=0x7fe95ab7e800, handle=...) at osd/PG.cc:242
#6 0x00007fe955a230b7 in OSD::ShardedOpWQ::_process (this=0x7fe95a961140, thread_index=<optimized out>, hb=<optimized out>) at osd/OSD.cc:8328
#7 0x00007fe955f803bf in ShardedThreadPool::shardedthreadpool_worker (this=0x7fe95a960758, thread_index=<optimized out>) at common/WorkQueue.cc:340
#8 0x00007fe955f822c0 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at common/WorkQueue.h:592
#9 0x00007fe9545af182 in start_thread (arg=0x7fe9384e6700) at pthread_create.c:312
#10 0x00007fe9528f738d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Actions #3

Updated by Samuel Just over 8 years ago

Here are all of the backtraces

Actions #4

Updated by Samuel Just over 8 years ago

  • Subject changed from "hit suicide timeout" in rados-infernalis-distro-basic-multi to possible osd messenger deadlock "hit suicide timeout" in rados-infernalis-distro-basic-multi
Actions #5

Updated by Samuel Just over 7 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF