Project

General

Profile

Actions

Bug #15412

closed

msg/simple: deadlock in delayed delivery

Added by Sage Weil about 8 years ago. Updated over 4 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
SimpleMessenger
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f9dc807f50f in Wait (mutex=..., this=0x7f9dd2bc7e18) at ./common/Cond.h:56
#2  Pipe::DelayedDelivery::stop_fast_dispatching (this=0x7f9dd2bc7d40) at msg/simple/Pipe.cc:249
#3  0x00007f9dc808064c in Pipe::stop_and_wait (this=this@entry=0x7f9dd37a0000) at msg/simple/Pipe.cc:1484
#4  0x00007f9dc808e578 in Pipe::accept (this=this@entry=0x7f9dd3dfc000) at msg/simple/Pipe.cc:688
#5  0x00007f9dc8091ddf in Pipe::reader (this=0x7f9dd3dfc000) at msg/simple/Pipe.cc:1499
#6  0x00007f9dc809988d in Pipe::Reader::entry (this=<optimized out>) at msg/simple/Pipe.h:52
#7  0x00007f9dc6260182 in start_thread (arg=0x7f9da0555700) at pthread_create.c:312
#8  0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 waiting for fast dispatch to finish...
 holds pipe_lock
 holds delay_lock

Thread 62 (Thread 0x7f9d9ef4d700 (LWP 29271)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd37a1180) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd37a1170, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc7f73ea6 in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:115
#5  is_connected (this=0x7f9dd37a0000) at msg/simple/Pipe.h:143
#6  SimpleMessenger::is_connected (this=<optimized out>, con=<optimized out>) at msg/simple/SimpleMessenger.cc:282
#7  0x00007f9dc791f236 in OSD::op_is_discardable (op=op@entry=0x7f9dd5180c00) at osd/OSD.cc:8568
#8  0x00007f9dc7952aba in OSD::handle_op (this=this@entry=0x7f9dd2860000, op=..., osdmap=...) at osd/OSD.cc:8418
#9  0x00007f9dc7953cde in OSD::dispatch_op_fast (this=this@entry=0x7f9dd2860000, op=..., osdmap=...) at osd/OSD.cc:6113
#10 0x00007f9dc7953fd8 in OSD::dispatch_session_waiting (this=this@entry=0x7f9dd2860000, session=session@entry=0x7f9dd37e8000, osdmap=...) at osd/OSD.cc:5751
#11 0x00007f9dc7954324 in OSD::ms_fast_dispatch (this=0x7f9dd2860000, m=<optimized out>) at osd/OSD.cc:5861
#12 0x00007f9dc8056d86 in ms_fast_dispatch (m=0x7f9dd5180c00, this=0x7f9dd26c2000) at ./msg/Messenger.h:555
#13 DispatchQueue::fast_dispatch (this=0x7f9dd26c21d0, m=m@entry=0x7f9dd5180c00) at msg/simple/DispatchQueue.cc:71
#14 0x00007f9dc8087d22 in Pipe::DelayedDelivery::entry (this=0x7f9dd2bc7d40) at msg/simple/Pipe.cc:226
#15 0x00007f9dc6260182 in start_thread (arg=0x7f9d9ef4d700) at pthread_create.c:312
#16 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

  waiting on pipe_lock
  holding session_dispatch_lock

/a/sage-2016-04-06_19:36:52-rados-wip-sage-testing-distro-basic-smithi/112908

Actions #1

Updated by Sage Weil about 8 years ago

Welcome to the Ceph Etherpad!

(gdb) thr app all bt

Thread 63 (Thread 0x7f9dab572700 (LWP 29010)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd29e8530) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd29e8520, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc79d78c0 in PG::lock (this=this@entry=0x7f9dd29e8000, no_lockdep=no_lockdep@entry=false) at osd/PG.cc:262
#5  0x00007f9dc79d7a4e in PG::lock_suspend_timeout (this=0x7f9dd29e8000, handle=...) at osd/PG.cc:256
#6  0x00007f9dc794a6be in OSD::ShardedOpWQ::_process (this=0x7f9dd28611e8, thread_index=<optimized out>, hb=<optimized out>) at osd/OSD.cc:8610
#7  0x00007f9dc7f8e187 in ShardedThreadPool::shardedthreadpool_worker (this=0x7f9dd2860768, thread_index=<optimized out>) at common/WorkQueue.cc:340
#8  0x00007f9dc7f900b0 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at common/WorkQueue.h:679
#9  0x00007f9dc6260182 in start_thread (arg=0x7f9dab572700) at pthread_create.c:312
#10 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on pg lock

Thread 62 (Thread 0x7f9d9ef4d700 (LWP 29271)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd37a1180) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd37a1170, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc7f73ea6 in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:115
#5  is_connected (this=0x7f9dd37a0000) at msg/simple/Pipe.h:143
#6  SimpleMessenger::is_connected (this=<optimized out>, con=<optimized out>) at msg/simple/SimpleMessenger.cc:282
#7  0x00007f9dc791f236 in OSD::op_is_discardable (op=op@entry=0x7f9dd5180c00) at osd/OSD.cc:8568
#8  0x00007f9dc7952aba in OSD::handle_op (this=this@entry=0x7f9dd2860000, op=..., osdmap=...) at osd/OSD.cc:8418
#9  0x00007f9dc7953cde in OSD::dispatch_op_fast (this=this@entry=0x7f9dd2860000, op=..., osdmap=...) at osd/OSD.cc:6113
#10 0x00007f9dc7953fd8 in OSD::dispatch_session_waiting (this=this@entry=0x7f9dd2860000, session=session@entry=0x7f9dd37e8000, osdmap=...) at osd/OSD.cc:5751
#11 0x00007f9dc7954324 in OSD::ms_fast_dispatch (this=0x7f9dd2860000, m=<optimized out>) at osd/OSD.cc:5861
#12 0x00007f9dc8056d86 in ms_fast_dispatch (m=0x7f9dd5180c00, this=0x7f9dd26c2000) at ./msg/Messenger.h:555
#13 DispatchQueue::fast_dispatch (this=0x7f9dd26c21d0, m=m@entry=0x7f9dd5180c00) at msg/simple/DispatchQueue.cc:71
#14 0x00007f9dc8087d22 in Pipe::DelayedDelivery::entry (this=0x7f9dd2bc7d40) at msg/simple/Pipe.cc:226
#15 0x00007f9dc6260182 in start_thread (arg=0x7f9d9ef4d700) at pthread_create.c:312
#16 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on pipe_lock
holding session_dispatch_lock

Thread 45 (Thread 0x7f9da8d6d700 (LWP 29015)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd37a1180) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=0x7f9dd37a1170, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc7f7756c in SimpleMessenger::submit_message (this=this@entry=0x7f9dd26c2000, m=m@entry=0x7f9dd365d340, con=con@entry=0x7f9dd2ac2480, dest_addr=..., dest_type=4, already_locked=already_locked@entry=false) at msg/simple/SimpleMessenger.cc:458
#5  0x00007f9dc7f78527 in SimpleMessenger::_send_message (this=0x7f9dd26c2000, m=0x7f9dd365d340, con=0x7f9dd2ac2480) at msg/simple/SimpleMessenger.cc:142
#6  0x00007f9dc79219d4 in OSDService::reply_op_error (this=<optimized out>, op=..., err=err@entry=0, v=..., uv=408) at osd/OSD.cc:1376
#7  0x00007f9dc7a8435d in ReplicatedPG::eval_repop (this=this@entry=0x7f9dd29e8000, repop=repop@entry=0x7f9dd2d785a0) at osd/ReplicatedPG.cc:8343
#8  0x00007f9dc7a849a1 in ReplicatedPG::repop_all_committed (this=0x7f9dd29e8000, repop=0x7f9dd2d785a0) at osd/ReplicatedPG.cc:8274
#9  0x00007f9dc7972c09 in Context::complete (this=0x7f9dd3273fa0, r=<optimized out>) at ./include/Context.h:64
#10 0x00007f9dc7b16ba6 in ReplicatedBackend::sub_op_modify_reply (this=this@entry=0x7f9dd29a6580, op=...) at osd/ReplicatedBackend.cc:727
#11 0x00007f9dc7b27b83 in ReplicatedBackend::handle_message (this=0x7f9dd29a6580, op=...) at osd/ReplicatedBackend.cc:216
#12 0x00007f9dc7a850cd in ReplicatedPG::do_request (this=0x7f9dd29e8000, op=..., handle=...) at osd/ReplicatedPG.cc:1474
#13 0x00007f9dc7946015 in OSD::dequeue_op (this=0x7f9dd2860000, pg=..., op=..., handle=...) at osd/OSD.cc:8767
#14 0x00007f9dc794623d in PGQueueable::RunVis::operator() (this=this@entry=0x7f9da8d6b820, op=...) at osd/OSD.cc:163
#15 0x00007f9dc794ac49 in internal_visit<std::shared_ptr<OpRequest> > (operand=..., this=<synthetic pointer>) at /usr/include/boost/variant/variant.hpp:1017
#16 visitation_impl_invoke_impl<boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*, std::shared_ptr<OpRequest> > (storage=0x7f9da8d6ba10, visitor=<synthetic pointer>) at /usr/include/boost/variant/detail/visitation_impl.hpp:130
#17 visitation_impl_invoke<boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*, std::shared_ptr<OpRequest>, boost::variant<std::shared_ptr<OpRequest>, PGSnapTrim, PGScrub>::has_fallback_type_> (internal_which=<optimized out>, t=0x0, storage=0x7f9da8d6ba10, visitor=<synthetic pointer>)
    at /usr/include/boost/variant/detail/visitation_impl.hpp:173
#18 visitation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<3l>, std::shared_ptr<OpRequest>, boost::mpl::l_item<mpl_::long_<2l>, PGSnapTrim, boost::mpl::l_item<mpl_::long_<1l>, PGScrub, boost::mpl::l_end> > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*, boost::variant<std::shared_ptr<OpRequest>, PGSnapTrim, PGScrub>::has_fallback_type_> (no_backup_flag=..., storage=0x7f9da8d6ba10, visitor=<synthetic pointer>, logical_which=<optimized out>, internal_which=<optimized out>)
    at /usr/include/boost/variant/detail/visitation_impl.hpp:256
#19 internal_apply_visitor_impl<boost::detail::variant::invoke_visitor<PGQueueable::RunVis>, void*> (storage=0x7f9da8d6ba10, visitor=<synthetic pointer>, logical_which=<optimized out>, internal_which=<optimized out>) at /usr/include/boost/variant/variant.hpp:2326
#20 internal_apply_visitor<boost::detail::variant::invoke_visitor<PGQueueable::RunVis> > (visitor=<synthetic pointer>, this=0x7f9da8d6ba08) at /usr/include/boost/variant/variant.hpp:2337
#21 apply_visitor<PGQueueable::RunVis> (visitor=..., this=0x7f9da8d6ba08) at /usr/include/boost/variant/variant.hpp:2360
#22 apply_visitor<PGQueueable::RunVis, boost::variant<std::shared_ptr<OpRequest>, PGSnapTrim, PGScrub> > (visitable=..., visitor=...) at /usr/include/boost/variant/detail/apply_visitor_unary.hpp:60
#23 run (handle=..., pg=..., osd=<optimized out>, this=0x7f9da8d6ba08) at osd/OSD.h:392
#24 OSD::ShardedOpWQ::_process (this=0x7f9dd28611e8, thread_index=<optimized out>, hb=<optimized out>) at osd/OSD.cc:8648
#25 0x00007f9dc7f8e187 in ShardedThreadPool::shardedthreadpool_worker (this=0x7f9dd2860768, thread_index=<optimized out>) at common/WorkQueue.cc:340
#26 0x00007f9dc7f900b0 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at common/WorkQueue.h:679
#27 0x00007f9dc6260182 in start_thread (arg=0x7f9da8d6d700) at pthread_create.c:312
#28 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on pipe_lock
holding pg lock

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f9dc807f50f in Wait (mutex=..., this=0x7f9dd2bc7e18) at ./common/Cond.h:56
#2  Pipe::DelayedDelivery::stop_fast_dispatching (this=0x7f9dd2bc7d40) at msg/simple/Pipe.cc:249
#3  0x00007f9dc808064c in Pipe::stop_and_wait (this=this@entry=0x7f9dd37a0000) at msg/simple/Pipe.cc:1484
#4  0x00007f9dc808e578 in Pipe::accept (this=this@entry=0x7f9dd3dfc000) at msg/simple/Pipe.cc:688
#5  0x00007f9dc8091ddf in Pipe::reader (this=0x7f9dd3dfc000) at msg/simple/Pipe.cc:1499
#6  0x00007f9dc809988d in Pipe::Reader::entry (this=<optimized out>) at msg/simple/Pipe.h:52
#7  0x00007f9dc6260182 in start_thread (arg=0x7f9da0555700) at pthread_create.c:312
#8  0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting for dispatch ...
holds pipe_lock
holds delay_lock

Thread 24 (Thread 0x7f9db7f0c700 (LWP 28988)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd29e8530) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd29e8520, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc79d78c0 in PG::lock (this=this@entry=0x7f9dd29e8000, no_lockdep=no_lockdep@entry=false) at osd/PG.cc:262
#5  0x00007f9dc795f4ce in OSD::consume_map (this=this@entry=0x7f9dd2860000) at osd/OSD.cc:7128
#6  0x00007f9dc796297e in OSD::_committed_osd_maps (this=0x7f9dd2860000, first=<optimized out>, last=20, m=0x7f9dd3be8400) at osd/OSD.cc:6939
#7  0x00007f9dc7972c09 in Context::complete (this=0x7f9dd3275de0, r=<optimized out>) at ./include/Context.h:64
#8  0x00007f9dc7ed0146 in Finisher::finisher_thread_entry (this=0x7f9dd26b2160) at common/Finisher.cc:68
#9  0x00007f9dc6260182 in start_thread (arg=0x7f9db7f0c700) at pthread_create.c:312
#10 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on pg lock
holding osd_lock

Thread 23 (Thread 0x7f9dbc715700 (LWP 28977)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd26c24c0) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd26c24b0, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc7f74381 in SimpleMessenger::mark_down (this=0x7f9dd26c2000, con=0x7f9dd266a100) at msg/simple/SimpleMessenger.cc:655
#5  0x00007f9dc7f3e2ae in MonClient::_reopen_session (this=this@entry=0x7ffcc939dc90, rank=rank@entry=-1, name=...) at mon/MonClient.cc:617
#6  0x00007f9dc793fddb in _reopen_session (this=0x7ffcc939dc90) at ./mon/MonClient.h:191
#7  reopen_session (cb=0x0, this=0x7ffcc939dc90) at ./mon/MonClient.h:346
#8  OSD::tick_without_osd_lock (this=0x7f9dd2860000) at osd/OSD.cc:4278
#9  0x00007f9dc7972c09 in Context::complete (this=0x7f9dd25d4e00, r=<optimized out>) at ./include/Context.h:64
#10 0x00007f9dc7f8721c in SafeTimer::timer_thread (this=0x7f9dd2860190) at common/Timer.cc:105
#11 0x00007f9dc7f88bbd in SafeTimerThread::entry (this=<optimized out>) at common/Timer.cc:38
#12 0x00007f9dc6260182 in start_thread (arg=0x7f9dbc715700) at pthread_create.c:312
#13 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on msgr lock
holds monc

Thread 22 (Thread 0x7f9db1d7f700 (LWP 28997)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd26c24c0) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd26c24b0, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc7f76911 in SimpleMessenger::add_accept_pipe (this=0x7f9dd26c2000, sd=sd@entry=150) at msg/simple/SimpleMessenger.cc:342
#5  0x00007f9dc803a04b in Accepter::entry (this=0x7f9dd26c2190) at msg/simple/Accepter.cc:246
#6  0x00007f9dc6260182 in start_thread (arg=0x7f9db1d7f700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#7  0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on msgr lock

Thread 12 (Thread 0x7f9db2d81700 (LWP 28995)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd37e80f0) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd37e80e0, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc7926b9e in Locker (m=..., this=<synthetic pointer>) at ./common/Mutex.h:115
#5  session_handle_reset (session=0x7f9dd37e8000, this=0x7f9dd2860000) at osd/OSD.h:1435
#6  OSD::ms_handle_reset (this=0x7f9dd2860000, con=0x7f9dd2ac2480) at osd/OSD.cc:4667
#7  0x00007f9dc805861e in ms_deliver_handle_reset (con=0x7f9dd2ac2480, this=0x7f9dd26c2000) at ./msg/Messenger.h:657
#8  DispatchQueue::entry (this=0x7f9dd26c21d0) at msg/simple/DispatchQueue.cc:164
#9  0x00007f9dc7f7a52d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/simple/DispatchQueue.h:103
#10 0x00007f9dc6260182 in start_thread (arg=0x7f9db2d81700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#11 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on session dispatch lock
holding ?

Thread 7 (Thread 0x7f9dbe719700 (LWP 28973)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd2860028) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd2860018, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc796b4b2 in OSD::ms_dispatch (this=0x7f9dd2860000, m=0x7f9dd3602880) at osd/OSD.cc:5721
#5  0x00007f9dc80bae27 in Messenger::ms_deliver_dispatch (this=0x7f9dd26ca000, m=0x7f9dd3602880) at ./msg/Messenger.h:584
#6  0x00007f9dc80bb581 in C_handle_dispatch::do_request (this=0x7f9dd3273f80, id=<optimized out>) at msg/async/AsyncConnection.cc:109
#7  0x00007f9dc805ce1d in EventCenter::process_events (this=this@entry=0x7f9dd26c6cc8, timeout_microseconds=timeout_microseconds@entry=30000000) at msg/async/Event.cc:431
#8  0x00007f9dc803da60 in Worker::entry (this=0x7f9dd26c6c80) at msg/async/AsyncMessenger.cc:294
#9  0x00007f9dc6260182 in start_thread (arg=0x7f9dbe719700) at pthread_create.c:312
#10 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on osd_lock

Thread 5 (Thread 0x7f9dbef1a700 (LWP 28972)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd2860028) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd2860018, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc796b4b2 in OSD::ms_dispatch (this=0x7f9dd2860000, m=0x7f9dd408f680) at osd/OSD.cc:5721
#5  0x00007f9dc80bae27 in Messenger::ms_deliver_dispatch (this=0x7f9dd26ca000, m=0x7f9dd408f680) at ./msg/Messenger.h:584
#6  0x00007f9dc80bb581 in C_handle_dispatch::do_request (this=0x7f9dd41654c0, id=<optimized out>) at msg/async/AsyncConnection.cc:109
#7  0x00007f9dc805ce1d in EventCenter::process_events (this=this@entry=0x7f9dd26c62c8, timeout_microseconds=timeout_microseconds@entry=30000000) at msg/async/Event.cc:431
#8  0x00007f9dc803da60 in Worker::entry (this=0x7f9dd26c6280) at msg/async/AsyncMessenger.cc:294
#9  0x00007f9dc6260182 in start_thread (arg=0x7f9dbef1a700) at pthread_create.c:312
#10 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on osd_lock

Thread 4 (Thread 0x7f9dbf71b700 (LWP 28971)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9dc6262657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9dc6262480 in __GI___pthread_mutex_lock (mutex=0x7f9dd2860028) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9dc7f4d7f8 in Mutex::Lock (this=this@entry=0x7f9dd2860018, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#4  0x00007f9dc796b4b2 in OSD::ms_dispatch (this=0x7f9dd2860000, m=0x7f9dd3bc1cc0) at osd/OSD.cc:5721
#5  0x00007f9dc80bae27 in Messenger::ms_deliver_dispatch (this=0x7f9dd26ca000, m=0x7f9dd3bc1cc0) at ./msg/Messenger.h:584
#6  0x00007f9dc80bb581 in C_handle_dispatch::do_request (this=0x7f9dd29307a0, id=<optimized out>) at msg/async/AsyncConnection.cc:109
#7  0x00007f9dc805ce1d in EventCenter::process_events (this=this@entry=0x7f9dd26c6548, timeout_microseconds=timeout_microseconds@entry=30000000) at msg/async/Event.cc:431
#8  0x00007f9dc803da60 in Worker::entry (this=0x7f9dd26c6500) at msg/async/AsyncMessenger.cc:294
#9  0x00007f9dc6260182 in start_thread (arg=0x7f9dbf71b700) at pthread_create.c:312
#10 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

waiting on osd_lock

Thread 1 (Thread 0x7f9dc2282700 (LWP 28967)):
#0  0x00007f9dc626820b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007f9dc7ea6de5 in reraise_fatal (signum=6) at global/signal_handler.cc:71
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:133
#3  <signal handler called>
#4  0x00007f9dc42cacc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5  0x00007f9dc42ce0d8 in __GI_abort () at abort.c:89
#6  0x00007f9dc7f9de65 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f9dc816b8c7 "0 == \"hit suicide timeout\"", file=file@entry=0x7f9dc816b83e "common/HeartbeatMap.cc", line=line@entry=86, 
    func=func@entry=0x7f9dc816ba80 <ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)::__PRETTY_FUNCTION__> "bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:78
#7  0x00007f9dc7ee15a1 in ceph::HeartbeatMap::_check (this=this@entry=0x7f9dd25fc7e0, h=<optimized out>, who=who@entry=0x7f9dc816b956 "is_healthy", now=now@entry=1460015325) at common/HeartbeatMap.cc:86
#8  0x00007f9dc7ee1d66 in ceph::HeartbeatMap::is_healthy (this=this@entry=0x7f9dd25fc7e0) at common/HeartbeatMap.cc:139
#9  0x00007f9dc7ee2557 in ceph::HeartbeatMap::check_touch_file (this=0x7f9dd25fc7e0) at common/HeartbeatMap.cc:167
#10 0x00007f9dc7fb4e6b in CephContextServiceThread::entry (this=0x7f9dd276a000) at common/ceph_context.cc:111
#11 0x00007f9dc6260182 in start_thread (arg=0x7f9dc2282700) at pthread_create.c:312
#12 0x00007f9dc438e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Actions #2

Updated by Greg Farnum about 8 years ago

So this is in the replace code, when using delayed delivery. That makes sense...except I thought we'd already seen and fixed this exact bug. :/ (Also, note that it's not anything users can see, just us in testing.)

...okay, no, this doesn't make sense. The delay thread drops the delay_lock when fast_dispatch'ing, so stop_fast_dispatching() can execute. stop_and_wait() drops the pipe lock while waiting for the delay thread to shut down. The only way something gets blocked in stop_fast_dispatching() is if the delay thread gets blocked enqueueing a message for regular delivery in the Pipe, which I don't think should really be possible either.

Actions #3

Updated by Haomai Wang about 8 years ago

looks the fix should be we release pipe_lock before entering delay cond wait

Actions #4

Updated by Greg Farnum about 8 years ago

We can't drop the pipe_lock while going through the Pipe replacement process; the new one isn't fully set up and the old one isn't torn down. We could get another one trying to start up, or attempts to access one of these in their intermediate states.

Actions #5

Updated by Haomai Wang about 8 years ago

see the deadlock at(http://tracker.ceph.com/issues/15440). I don't have a perfect idea to solve this problem without too much invading. Because the lock dependence(we may own pipe_lock and pg_lock, and pg_lock then pipe_lock) can't be removed. One way is we modify delay impl, replace dedicated delay thread dispatch to reader thread delay dispatch.

Actions #6

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions #7

Updated by Greg Farnum about 5 years ago

  • Category set to SimpleMessenger
Actions #8

Updated by Sage Weil over 4 years ago

  • Status changed from 12 to Won't Fix

simplemessenger removed from code base

Actions

Also available in: Atom PDF