Project

General

Profile

Bug #38537

mgr deadlock

Added by Sage Weil about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Thread 45 (Thread 0x7fbfa869b700 (LWP 1914003)):
#0  0x00007fbfd1667827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x149cf80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x149cf80, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007fbfd16678d4 in __new_sem_wait_slow (sem=0x149cf80, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007fbfd166797a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00007fbfd1bbafe8 in PyThread_acquire_lock () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#5  0x00007fbfd1b8f926 in PyEval_RestoreThread () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0

#6  0x0000000000507408 in ActivePyModules::<lambda(const OSDMap&, const PGMap&)>::operator() (__closure=<optimized out>, __closure=<optimized out>, pg_map=..., osd_map=...) at /build/ceph-14.1.0-101-gdddb858/src/mgr/ActivePyModules.cc:333
#7  Objecter::with_osdmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&, const PGMap&)>, const PGMap&> (cb=<optimized out>, this=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/src/osdc/Objecter.h:2056
#8  ClusterState::with_osdmap_and_pgmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&, const PGMap&)> > (cb=<optimized out>, this=0x4268368) at /build/ceph-14.1.0-101-gdddb858/src/mgr/ClusterState.h:138
#9  ActivePyModules::get_python (this=this@entry=0x1307de0, what=...) at /build/ceph-14.1.0-101-gdddb858/src/mgr/ActivePyModules.cc:329
#10 0x00000000005156e7 in ceph_state_get (self=<optimized out>, args=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/src/mgr/BaseMgrModule.cc:344
#11 0x00007fbfd1b98971 in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#12 0x00007fbfd1b97044 in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#13 0x00007fbfd1b97044 in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#14 0x00007fbfd1b97044 in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0

i.e., it holds ClusterState locks, acquiring GIL
Thread 26 (Thread 0x7fbfba1f9700 (LWP 1913869)):
#0  0x00007fbfd1664730 in futex_wait (private=<optimized out>, expected=2, futex_word=0x7ffebddb3acc) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=2, futex_word=0x7ffebddb3acc) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_rwlock_wrlock_slow (rwlock=0x7ffebddb3ac0) at pthread_rwlock_wrlock.c:67
#3  0x00007fbfd1664918 in __GI___pthread_rwlock_wrlock (rwlock=<optimized out>) at pthread_rwlock_wrlock.c:124
#4  0x00000000005d41b6 in std::__shared_mutex_pthread::lock (this=<optimized out>) at /usr/include/c++/7/shared_mutex:103
---Type <return> to continue, or q <return> to quit---
#5  std::shared_mutex::lock (this=<optimized out>) at /usr/include/c++/7/shared_mutex:329
#6  ceph::shunique_lock<std::shared_mutex>::lock (this=0x7fbfba1f5880) at /build/ceph-14.1.0-101-gdddb858/src/common/shunique_lock.h:157
#7  ceph::shunique_lock<std::shared_mutex>::shunique_lock (m=..., this=0x7fbfba1f5880) at /build/ceph-14.1.0-101-gdddb858/src/common/shunique_lock.h:65
#8  Objecter::submit_command (this=this@entry=0x7ffebddb39d8, c=c@entry=0xc234580, ptid=ptid@entry=0x7fbfba1f59b0) at /build/ceph-14.1.0-101-gdddb858/src/osdc/Objecter.cc:4751
#9  0x0000000000516d12 in Objecter::osd_command (onfinish=0x88d87e0, prs=<optimized out>, poutbl=0x88d8848, ptid=0x7fbfba1f59b0, inbl=..., cmd=..., osd=64, this=0x7ffebddb39d8) at /build/ceph-14.1.0-101-gdddb858/src/osdc/Objecter.h:2224
#10 ceph_send_command (self=<optimized out>, args=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/src/mgr/BaseMgrModule.cc:178
#11 0x00007fbfd1b97772 in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#12 0x00007fbfd1cce05c in PyEval_EvalCodeEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#13 0x00007fbfd1b96f1d in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#14 0x00007fbfd1cce05c in PyEval_EvalCodeEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#15 0x00007fbfd1b96f1d in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#16 0x00007fbfd1b97044 in PyEval_EvalFrameEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#17 0x00007fbfd1cce05c in PyEval_EvalCodeEx () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#18 0x00007fbfd1c24370 in ?? () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#19 0x00007fbfd1bf7273 in PyObject_Call () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#20 0x00007fbfd1c6b3ac in ?? () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#21 0x00007fbfd1bf7273 in PyObject_Call () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#22 0x00007fbfd1bf8444 in PyObject_CallMethod () from target:/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
#23 0x00000000005ad99b in PyModuleRunner::serve (this=0x4472180) at /build/ceph-14.1.0-101-gdddb858/src/mgr/PyModuleRunner.cc:47
#24 0x00000000005adff5 in PyModuleRunner::PyModuleRunnerThread::entry (this=0x44721c8) at /build/ceph-14.1.0-101-gdddb858/src/mgr/PyModuleRunner.cc:106
#25 0x00007fbfd165f6ba in start_thread (arg=0x7fbfba1f9700) at pthread_create.c:333
#26 0x00007fbfd0e8841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

holds GIL, acquiring objecter rwlock!
(gdb) p rwlock
$1 = {_M_impl = {_M_rwlock = {__data = {__lock = 0, __nr_readers = 1, __readers_wakeup = 3, __writer_wakeup = 2, __nr_readers_queued = 0, __nr_writers_queued = 1, __writer = 0, __shared = 0, __rwelision = 1 '\001', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, 
      __size = "\000\000\000\000\001\000\000\000\003\000\000\000\002\000\000\000\000\000\000\000\001", '\000' <repeats 11 times>, "\001", '\000' <repeats 22 times>, __align = 4294967296}}}

...

Thread 24 (Thread 0x7fbfbb9fc700 (LWP 1913866)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fbfd1661dbd in __GI___pthread_mutex_lock (mutex=0x4268568) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fbfd23cf7b9 in Mutex::lock(bool) () from target:/usr/lib/ceph/libceph-common.so.0
#3  0x000000000055f741 in std::lock_guard<Mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/7/bits/std_mutex.h:162
#4  ClusterState::with_mutable_pgmap<DaemonServer::send_report()::<lambda(PGMap&)> > (cb=<optimized out>, this=0x4268368) at /build/ceph-14.1.0-101-gdddb858/src/mgr/ClusterState.h:110
#5  DaemonServer::send_report (this=this@entry=0x4269238) at /build/ceph-14.1.0-101-gdddb858/src/mgr/DaemonServer.cc:2289
#6  0x0000000000560ebf in DaemonServer::tick (this=0x4269238) at /build/ceph-14.1.0-101-gdddb858/src/mgr/DaemonServer.cc:323
#7  0x00000000005116c9 in boost::function1<void, int>::operator() (a0=<optimized out>, this=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/obj-x86_64-linux-gnu/boost/include/boost/function/function_template.hpp:768
#8  FunctionContext::finish (this=<optimized out>, r=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/src/include/Context.h:487
#9  0x000000000050e4e9 in Context::complete (this=0x16245240, r=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/src/include/Context.h:77
#10 0x00007fbfd23e6420 in SafeTimer::timer_thread() () from target:/usr/lib/ceph/libceph-common.so.0
#11 0x00007fbfd23e7ced in SafeTimerThread::entry() () from target:/usr/lib/ceph/libceph-common.so.0
#12 0x00007fbfd165f6ba in start_thread (arg=0x7fbfbb9fc700) at pthread_create.c:333

holds ???, acquiring clusterstate lock
Thread 22 (Thread 0x7fbfbc9fe700 (LWP 1913864)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fbfd1661dbd in __GI___pthread_mutex_lock (mutex=0x4268568) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fbfd23cf7b9 in Mutex::lock(bool) () from target:/usr/lib/ceph/libceph-common.so.0

#3  0x0000000000532b87 in std::lock_guard<Mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/7/bits/std_mutex.h:162
#4  ClusterState::ingest_pgstats (this=0x4268368, stats=0x12518340) at /build/ceph-14.1.0-101-gdddb858/src/mgr/ClusterState.cc:69
#5  0x0000000000560e1c in DaemonServer::ms_dispatch (this=0x4269238, m=0x12518340) at /build/ceph-14.1.0-101-gdddb858/src/mgr/DaemonServer.cc:266
#6  0x0000000000574c06 in Dispatcher::ms_dispatch2 (this=0x4269238, m=...) at /build/ceph-14.1.0-101-gdddb858/src/msg/Dispatcher.h:126
#7  0x00007fbfd2570809 in DispatchQueue::entry() () from target:/usr/lib/ceph/libceph-common.so.0
#8  0x00007fbfd261f84d in DispatchQueue::DispatchThread::entry() () from target:/usr/lib/ceph/libceph-common.so.0
#9  0x00007fbfd165f6ba in start_thread (arg=0x7fbfbc9fe700) at pthread_create.c:333
#10 0x00007fbfd0e8841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

holds ??, acquiring clsuterstate
Thread 12 (Thread 0x7fbfc7b58700 (LWP 1913663)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fbfd292965c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from target:/usr/lib/ceph/libceph-common.so.0
#2  0x00007fbfd239c9c5 in Finisher::finisher_thread_entry() () from target:/usr/lib/ceph/libceph-common.so.0
#3  0x00007fbfd165f6ba in start_thread (arg=0x7fbfc7b58700) at pthread_create.c:333
#4  0x00007fbfd0e8841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

#2  0x00007fbfd23cf7b9 in Mutex::lock(bool) () from target:/usr/lib/ceph/libceph-common.so.0
#3  0x000000000057c2ec in std::lock_guard<Mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/7/bits/std_mutex.h:162
#4  Mgr::get_services[abi:cxx11]() const (this=0x4268000) at /build/ceph-14.1.0-101-gdddb858/src/mgr/Mgr.cc:686
#5  0x000000000058ae14 in MgrStandby::send_beacon (this=this@entry=0x7ffebddb3270) at /build/ceph-14.1.0-101-gdddb858/src/mgr/MgrStandby.cc:244
#6  0x000000000058b362 in MgrStandby::tick (this=0x7ffebddb3270) at /build/ceph-14.1.0-101-gdddb858/src/mgr/MgrStandby.cc:253
#7  0x00000000005116c9 in boost::function1<void, int>::operator() (a0=<optimized out>, this=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/obj-x86_64-linux-gnu/boost/include/boost/function/function_template.hpp:768
#8  FunctionContext::finish (this=<optimized out>, r=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/src/include/Context.h:487
#9  0x000000000050e4e9 in Context::complete (this=0x1bbbfd10, r=<optimized out>) at /build/ceph-14.1.0-101-gdddb858/src/include/Context.h:77
#10 0x00007fbfd23e6420 in SafeTimer::timer_thread() () from target:/usr/lib/ceph/libceph-common.so.0
#11 0x00007fbfd23e7ced in SafeTimerThread::entry() () from target:/usr/lib/ceph/libceph-common.so.0
<pre>
blocking trying to take mgr lock

<pre>
Thread 13 (Thread 0x7fbfc7357700 (LWP 1913664)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fbfd1661dbd in __GI___pthread_mutex_lock (mutex=0x4268568) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fbfd23cf7b9 in Mutex::lock(bool) () from target:/usr/lib/ceph/libceph-common.so.0
#3  0x0000000000532a7b in std::lock_guard<Mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/7/bits/std_mutex.h:162
#4  ClusterState::set_service_map (this=0x4268368, new_service_map=...) at /build/ceph-14.1.0-101-gdddb858/src/mgr/ClusterState.cc:57
#5  0x0000000000582443 in Mgr::handle_service_map (this=this@entry=0x4268000, m=m@entry=0x4215400) at /build/ceph-14.1.0-101-gdddb858/src/mgr/Mgr.cc:509
#6  0x00000000005844fb in Mgr::ms_dispatch (this=this@entry=0x4268000, m=m@entry=0x4215400) at /build/ceph-14.1.0-101-gdddb858/src/mgr/Mgr.cc:556
#7  0x000000000058ccbe in MgrStandby::ms_dispatch (this=0x7ffebddb3270, m=0x4215400) at /build/ceph-14.1.0-101-gdddb858/src/mgr/MgrStandby.cc:436
#8  0x0000000000574c06 in Dispatcher::ms_dispatch2 (this=0x7ffebddb3270, m=...) at /build/ceph-14.1.0-101-gdddb858/src/msg/Dispatcher.h:126
#9  0x00007fbfd2570809 in DispatchQueue::entry() () from target:/usr/lib/ceph/libceph-common.so.0
#10 0x00007fbfd261f84d in DispatchQueue::DispatchThread::entry() () from target:/usr/lib/ceph/libceph-common.so.0
#11 0x00007fbfd165f6ba in start_thread (arg=0x7fbfc7357700) at pthread_create.c:333
</pre>

Related issues

Copied to RADOS - Backport #38561: mimic: mgr deadlock Resolved
Copied to RADOS - Backport #38562: luminous: mgr deadlock Resolved

History

#1 Updated by Sage Weil about 5 years ago

  • Status changed from In Progress to Fix Under Review

#2 Updated by Kefu Chai about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to luminous, mimic

#3 Updated by Nathan Cutler about 5 years ago

#4 Updated by Nathan Cutler about 5 years ago

#5 Updated by Nathan Cutler almost 5 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF