Project

General

Profile

Bug #55355

osd thread deadlock

Added by xueyong lu almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
% Done:

100%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/nautilus-x
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

my ceph version is 14.2.22
After the network is abnormal, the osd cannot join the cluster.

Then I find the osd thread have deadlock

45   Thread 0x7fdf0ec7f700 (LWP 22855) "cfin" 0x00007fdf2620c54d in __lll_lock_wait () from /lib64/libpthread.so.0
71   Thread 0x7fdf20ca3700 (LWP 20084) "msgr-worker-2" 0x00007fdf2620c54d in __lll_lock_wait () from /lib64/libpthread.so.0

(gdb) t 45
[Switching to thread 45 (Thread 0x7fdf0ec7f700 (LWP 22855))]
#0  0x00007fdf2620c54d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fdf2620c54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fdf26207e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fdf26207d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055f9d176317f in __gthread_mutex_lock (__mutex=0x55fa432cd268) at /opt/rh/devtoolset-8/root/usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#4  lock (this=0x55fa432cd268) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_mutex.h:103
#5  AsyncConnection::stop(bool) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncConnection.cc:711
#6  0x000055f9d15a61c5 in AsyncMessenger::shutdown_connections(bool) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncMessenger.cc:806
#7  0x000055f9d15a7e36 in mark_down_all (this=0x55f9dca4b600) at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncMessenger.h:158
#8  AsyncMessenger::rebind(std::set<int, std::less<int>, std::allocator<int> > const&) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncMessenger.cc:428
#9  0x000055f9d0e782c2 in OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) () at /usr/src/debug/ceph-14.2.22-2/src/osd/OSD.cc:8750
#10 0x000055f9d0ecf7b7 in C_OnMapCommit::finish (this=0x55fa463b3ee0, r=<optimized out>) at /usr/src/debug/ceph-14.2.22-2/src/osd/OSD.cc:8204
#11 0x000055f9d0e80399 in Context::complete (this=0x55fa463b3ee0, r=<optimized out>) at /usr/src/debug/ceph-14.2.22-2/src/include/Context.h:77
#12 0x000055f9d13f454f in Finisher::finisher_thread_entry() () at /usr/src/debug/ceph-14.2.22-2/src/common/Finisher.cc:67
#13 0x00007fdf26205ea5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fdf250ca8cd in clone () from /lib64/libc.so.6
(gdb) 
(gdb) f 5
#5  AsyncConnection::stop(bool) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncConnection.cc:711
711      lock.lock()
(gdb) p lock
$5 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 20084, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, 
          __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000tN\000\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
(gdb) p &lock
$6 = (std::mutex *) 0x55fa432cd268
(gdb) 
(gdb) f 6
#6  0x000055f9d15a61c5 in AsyncMessenger::shutdown_connections(bool) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncMessenger.cc:806
806        p->stop(queue_reset);
(gdb) p lock
$17 = {name = "AsyncMessenger::lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 2, __count = 0, __owner = 22855, __nusers = 1, 
      __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\002\000\000\000\000\000\000\000GY\000\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 2}, nlock = 1, locked_by = 140596002420480}
(gdb) p &lock
$18 = (Mutex *) 0x55f9dca4bc80
(gdb) 

thread 45  holding lock A(0x55f9dca4bc80) and wait for lock B(0x55fa432cd268)=============

(gdb) t 71
[Switching to thread 71 (Thread 0x7fdf20ca3700 (LWP 20084))]
#0  0x00007fdf2620c54d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fdf2620c54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fdf26207e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fdf26207d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055f9d1415429 in Mutex::lock (this=this@entry=0x55f9dca4bc80, no_lockdep=no_lockdep@entry=false) at /usr/src/debug/ceph-14.2.22-2/src/common/Mutex.cc:78
#4  0x000055f9d15a6dcc in lock_guard (__m=..., this=<synthetic pointer>) at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncMessenger.cc:855
#5  AsyncMessenger::accept_conn(boost::intrusive_ptr<AsyncConnection>) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncMessenger.cc:855
#6  0x000055f9d179777f in ProtocolV2::send_server_ident() () at /usr/src/debug/ceph-14.2.22-2/src/common/RefCountedObj.h:171
#7  0x000055f9d179c923 in ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection>) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:2543
#8  0x000055f9d179e183 in ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&) () at /usr/src/debug/ceph-14.2.22-2/src/common/RefCountedObj.h:171
#9  0x000055f9d179e64b in ProtocolV2::handle_frame_payload() () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:1217
#10 0x000055f9d179e8a0 in ProtocolV2::handle_read_frame_dispatch() () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:1126
#11 0x000055f9d179ea45 in ProtocolV2::_handle_read_frame_epilogue_main() () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:1316
#12 0x000055f9d179eaea in ProtocolV2::_handle_read_frame_segment() () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:1185
#13 0x000055f9d179f90d in ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int) ()
    at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:1177
#14 0x000055f9d178a3c4 in ProtocolV2::run_continuation(Ct<ProtocolV2>&) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:47
#15 0x000055f9d1765bd6 in operator() (__args#1=<optimized out>, __args#0=<optimized out>, this=0x55fa2eed6fc8)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:260
#16 AsyncConnection::process() () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncConnection.cc:450
#17 0x000055f9d15af0b5 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) ()
    at /usr/src/debug/ceph-14.2.22-2/src/common/StackStringStream.h:143
#18 0x000055f9d15b3817 in operator() (__closure=0x55f9dc93b2a8, __closure=0x55f9dc93b2a8) at /usr/src/debug/ceph-14.2.22-2/src/msg/async/Stack.cc:53
#19 std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:297
#20 0x000055f9d1a98fbf in execute_native_thread_routine ()
#21 0x00007fdf26205ea5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007fdf250ca8cd in clone () from /lib64/libc.so.6

(gdb) f 5
#5  AsyncMessenger::accept_conn(boost::intrusive_ptr<AsyncConnection>) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncMessenger.cc:855
855      Mutex::Locker l(lock);
(gdb) p lock
$9 = {name = "AsyncMessenger::lock", id = -1, recursive = false, lockdep = true, backtrace = false, _m = {__data = {__lock = 2, __count = 0, __owner = 22855, __nusers = 1, 
      __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\002\000\000\000\000\000\000\000GY\000\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 2}, nlock = 1, locked_by = 140596002420480}
(gdb) p &lock
$12 = (Mutex *) 0x55f9dca4bc80
(gdb) f 7
#7  0x000055f9d179c923 in ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection>) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/ProtocolV2.cc:2543
2543        return send_server_ident();
(gdb) p existing
$5 = {px = 0x55fa432cd000}
(gdb) 
(gdb) p l
$65 = {_M_device = @0x55fa432cd268}
(gdb) p l._M_device
$66 = (std::lock_guard<std::mutex>::mutex_type &) @0x55fa432cd268: {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 20084, __nusers = 1, 
        __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000tN\000\000\001", '\000' <repeats 26 times>, 
      __align = 2}}, <No data fields>}

thread 71  holding lock B(0x55fa432cd268) and wait for lock A(0x55f9dca4bc80)=============


Related issues

Copied to RADOS - Backport #56722: pacific: osd thread deadlock Resolved
Copied to RADOS - Backport #56723: quincy: osd thread deadlock Resolved

History

#1 Updated by xueyong lu almost 2 years ago

I find thread 45 wants to stop connection,but the connection has been stopped by thread 71


(gdb) f 5
#5 AsyncConnection::stop(bool) () at /usr/src/debug/ceph-14.2.22-2/src/msg/async/AsyncConnection.cc:711
711 lock.lock();
(gdb) p this
$38 = (AsyncConnection * const) 0x55fa432cd000
(gdb) p state
$39 = 5 //STATE_CLOSED

if need check the state value before lock.lock(),if the value is STATE_CLOSED,return directly.

void AsyncConnection::stop(bool queue_reset) {
lock.lock();
bool need_queue_reset = (state != STATE_CLOSED) && queue_reset;
protocol->stop();
lock.unlock();
if (need_queue_reset) dispatch_queue->queue_reset(this);
}

#2 Updated by Radoslaw Zarzynski almost 2 years ago

  • Description updated (diff)

#3 Updated by Radoslaw Zarzynski almost 2 years ago

Hello! 14.2.22 is out-of-live actually. Would you be able to verify the issue on a newer one?

#4 Updated by xueyong lu almost 2 years ago

Thanks for your reply @Radoslaw Zarzynski.
I checked the latest code and found that the code logic is the same. I think the new version has the same problem, but the problem occurs probabilistically.

#5 Updated by Radoslaw Zarzynski almost 2 years ago

  • Assignee set to Radoslaw Zarzynski

#6 Updated by jianwei zhang almost 2 years ago

ceph v15.2.13

I found an almost identical stack waiting for a lock

# file core.ceph-osd.167.022de30fa89a4d9a8a8a7903919ea8fb.908795.1653283610000000
core.ceph-osd.167.022de30fa89a4d9a8a8a7903919ea8fb.908795.1653283610000000: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/ceph-osd -f --cluster ceph --id 47 --setuser ceph --setgroup ceph', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/ceph-osd', platform: 'x86_64'
[root@thost4 coredump]# gdb /usr/bin/ceph-osd core.ceph-osd.167.022de30fa89a4d9a8a8a7903919ea8fb.908795.1653283610000000
(gdb) t 35
[Switching to thread 35 (Thread 0x7f4e3ad5b700 (LWP 910429))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1da3b9e40) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1da3b9e40) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a6dc6ff5 in AsyncConnection::stop (this=0x55f1da3b9c00, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:729
#5  0x000055f1a6acee25 in AsyncMessenger::shutdown_connections (this=0x55f1b247a800, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:792
#6  0x000055f1a6ad3117 in AsyncMessenger::mark_down_all (this=0x55f1b247a800) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.h:160
#7  0x000055f1a6acade0 in AsyncMessenger::rebind (this=0x55f1b247a800, avoid_ports=std::set with 2 elements = {...}) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:431
#8  0x000055f1a5d186b9 in OSD::_committed_osd_maps (this=0x55f1b3078000, first=3957, last=3960, m=0x55f1b98c3c00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:8284
#9  0x000055f1a5d66487 in C_OnMapCommit::finish (this=0x55f1c783fae0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:7704
#10 0x000055f1a5d4125f in Context::complete (this=0x55f1c783fae0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/include/Context.h:77
#11 0x000055f1a67eda3b in Finisher::finisher_thread_entry (this=0x55f1b2482478) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Finisher.cc:66
#12 0x000055f1a5d52ba2 in Finisher::FinisherThread::entry (this=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Finisher.h:63
#13 0x000055f1a6826dd4 in Thread::entry_wrapper (this=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:84
#14 0x000055f1a6826d4e in Thread::_entry_func (arg=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:71
#15 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#16 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) t 3
[Switching to thread 3 (Thread 0x7f4e4b837700 (LWP 908805))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1b247ae40) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1b247ae40) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a5cab1e2 in std::lock_guard<std::mutex>::lock_guard (this=0x7f4e4b8336f8, __m=...) at /usr/include/c++/8/bits/std_mutex.h:162
#5  0x000055f1a6acf76f in AsyncMessenger::accept_conn (this=0x55f1b247a800, conn=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:847
#6  0x000055f1a6e147ef in ProtocolV2::send_server_ident (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2807
#7  0x000055f1a6e12234 in ProtocolV2::handle_existing_connection (this=0x55f1e95e7700, existing=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2538
#8  0x000055f1a6e0f456 in ProtocolV2::handle_client_ident (this=0x55f1e95e7700, payload=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2356
#9  0x000055f1a6e01a2d in ProtocolV2::handle_frame_payload (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1226
#10 0x000055f1a6e00a3c in ProtocolV2::handle_read_frame_dispatch (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1135
#11 0x000055f1a6e02719 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1325
#12 0x000055f1a6e0169e in ProtocolV2::_handle_read_frame_segment (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1194
#13 0x000055f1a6e015ac in ProtocolV2::handle_read_frame_segment (this=0x55f1e95e7700, rx_buffer=..., r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1186
#14 0x000055f1a6e2c47a in CtRxNode<ProtocolV2>::call (this=0x55f1e95e7a88, foo=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Protocol.h:67
#15 0x000055f1a6df4ac0 in ProtocolV2::run_continuation (this=0x55f1e95e7700, continuation=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#16 0x000055f1a6dfbd30 in ProtocolV2::<lambda(char*, int)>::operator()(char *, int) const (__closure=0x55f1f1309fa0, buffer=0x55f1c22e0b00 "\b\001\236", r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:755
#17 0x000055f1a6e167c7 in std::_Function_handler<void(char*, long int), ProtocolV2::read(CONTINUATION_RXBPTR_TYPE<ProtocolV2>&, rx_buffer_t&&)::<lambda(char*, int)> >::_M_invoke(const std::_Any_data &, char *&&, long &&) (__functor=..., __args#0=@0x7f4e4b8341c0: 0x55f1c22e0b00 "\b\001\236", __args#1=@0x7f4e4b8341b8: 0) at /usr/include/c++/8/bits/std_function.h:297
#18 0x000055f1a6dca24d in std::function<void (char*, long)>::operator()(char*, long) const (this=0x55f1f1309fa0, __args#0=0x55f1c22e0b00 "\b\001\236", __args#1=0) at /usr/include/c++/8/bits/std_function.h:687
#19 0x000055f1a6dc4b27 in AsyncConnection::process (this=0x55f1f1309c00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:457
#20 0x000055f1a6dc9644 in C_handle_read::do_request (this=0x55f1cfa0d070, fd_or_id=232) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:71
#21 0x000055f1a6ae7e3a in EventCenter::process_events (this=0x55f1b241d280, timeout_microseconds=30000000, working_dur=0x7f4e4b834578) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Event.cc:406
#22 0x000055f1a6af0826 in NetworkStack::<lambda()>::operator()(void) const (__closure=0x55f1b243bbf8) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Stack.cc:53
#23 0x000055f1a6af1d41 in std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#24 0x000055f1a5ff0dfa in std::function<void ()>::operator()() const (this=0x55f1b243bbf8) at /usr/include/c++/8/bits/std_function.h:687
#25 0x000055f1a6e2e9e1 in std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#26 0x000055f1a6e2e1c5 in std::__invoke<std::function<void ()>>(std::function<void ()>&&) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#27 0x000055f1a6e2ffce in std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x55f1b243bbf8) at /usr/include/c++/8/thread:244
#28 0x000055f1a6e2ffa0 in std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()() (this=0x55f1b243bbf8) at /usr/include/c++/8/thread:253
#29 0x000055f1a6e2ff80 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run() (this=0x55f1b243bbf0) at /usr/include/c++/8/thread:196
#30 0x00007f4e5013cba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#31 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#32 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6

#7 Updated by jianwei zhang almost 2 years ago

(gdb) t 35
[Switching to thread 35 (Thread 0x7f4e3ad5b700 (LWP 910429))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1da3b9e40) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1da3b9e40) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a6dc6ff5 in AsyncConnection::stop (this=0x55f1da3b9c00, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:729

void AsyncConnection::stop(bool queue_reset) {
  lock.lock();                    ///0x55f1da3b9e40
  bool need_queue_reset = (state != STATE_CLOSED) && queue_reset;
  protocol->stop();
  lock.unlock();
  if (need_queue_reset) dispatch_queue->queue_reset(this);
}

#5  0x000055f1a6acee25 in AsyncMessenger::shutdown_connections (this=0x55f1b247a800, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:792
void AsyncMessenger::shutdown_connections(bool queue_reset)
{
  ldout(cct,1) << __func__ << " " << dendl;
  std::lock_guard l{lock};        ///0x55f1b247ae40
  for (const auto& c : accepting_conns) {
    ldout(cct, 5) << __func__ << " accepting_conn " << c << dendl;
    c->stop(queue_reset);
  }
  accepting_conns.clear();

  for (const auto& [e, c] : conns) {
    ldout(cct, 5) << __func__ << " mark down " << e << " " << c << dendl;
    c->get_perf_counter()->dec(l_msgr_active_connections);
    c->stop(queue_reset);
  }
  ... ...
}

#6  0x000055f1a6ad3117 in AsyncMessenger::mark_down_all (this=0x55f1b247a800) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.h:160
#7  0x000055f1a6acade0 in AsyncMessenger::rebind (this=0x55f1b247a800, avoid_ports=std::set with 2 elements = {...}) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:431
#8  0x000055f1a5d186b9 in OSD::_committed_osd_maps (this=0x55f1b3078000, first=3957, last=3960, m=0x55f1b98c3c00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:8284
#9  0x000055f1a5d66487 in C_OnMapCommit::finish (this=0x55f1c783fae0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:7704
#10 0x000055f1a5d4125f in Context::complete (this=0x55f1c783fae0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/include/Context.h:77
#11 0x000055f1a67eda3b in Finisher::finisher_thread_entry (this=0x55f1b2482478) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Finisher.cc:66
#12 0x000055f1a5d52ba2 in Finisher::FinisherThread::entry (this=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Finisher.h:63
#13 0x000055f1a6826dd4 in Thread::entry_wrapper (this=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:84
#14 0x000055f1a6826d4e in Thread::_entry_func (arg=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:71
#15 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#16 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) frame 0
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) f 5
#5  0x000055f1a6acee25 in AsyncMessenger::shutdown_connections (this=0x55f1b247a800, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:792
792        c->stop(queue_reset);
(gdb) f 4
#4  0x000055f1a6dc6ff5 in AsyncConnection::stop (this=0x55f1da3b9c00, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:729
729      lock.lock();
(gdb) p lock
$13 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 908805, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\005\336\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}

(gdb) p &lock
$3 = (std::mutex *) 0x55f1da3b9e40
(gdb) f 5
#5  0x000055f1a6acee25 in AsyncMessenger::shutdown_connections (this=0x55f1b247a800, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:792
792        c->stop(queue_reset);
(gdb) p lock
$4 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910429, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000]\344\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &lock
$5 = (ceph::mutex *) 0x55f1b247ae40

thread 35(910429) holding 0x55f1b247ae40 lock, waiting for thread 3(908805) holding 0x55f1da3b9e40 lock

(gdb) t 3
[Switching to thread 3 (Thread 0x7f4e4b837700 (LWP 908805))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1b247ae40) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1b247ae40) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a5cab1e2 in std::lock_guard<std::mutex>::lock_guard (this=0x7f4e4b8336f8, __m=...) at /usr/include/c++/8/bits/std_mutex.h:162
#5  0x000055f1a6acf76f in AsyncMessenger::accept_conn (this=0x55f1b247a800, conn=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:847
int AsyncMessenger::accept_conn(const AsyncConnectionRef& conn)
{
  std::lock_guard l{lock};            ///0x55f1b247ae40
  if (conn->policy.server &&
      conn->policy.lossy &&
      !conn->policy.register_lossy_clients) {
    anon_conns.insert(conn);
    conn->get_perf_counter()->inc(l_msgr_active_connections);
    return 0;
  }
  auto it = conns.find(*conn->peer_addrs);
  if (it != conns.end()) {
    auto& existing = it->second;

    // lazy delete, see "deleted_conns" 
    // If conn already in, we will return 0
    std::lock_guard l{deleted_lock};
    if (deleted_conns.erase(existing)) {
      conns.erase(it);
    } else if (conn != existing) {
    ... ...
}
#6  0x000055f1a6e147ef in ProtocolV2::send_server_ident (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2807
#7  0x000055f1a6e12234 in ProtocolV2::handle_existing_connection (this=0x55f1e95e7700, existing=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2538
CtPtr ProtocolV2::handle_existing_connection(const AsyncConnectionRef& existing) {
  ldout(cct, 20) << __func__ << " existing=" << existing << dendl;

  std::lock_guard<std::mutex> l(existing->lock);

  ProtocolV2 *exproto = dynamic_cast<ProtocolV2 *>(existing->protocol.get());
  if (!exproto) {
    ldout(cct, 1) << __func__ << " existing=" << existing << dendl;
    ceph_assert(false);
  }

  if (exproto->state == CLOSED) {
    ldout(cct, 1) << __func__ << " existing " << existing << " already closed." 
                  << dendl;
    return send_server_ident();
  }
  ... ...
}
#8  0x000055f1a6e0f456 in ProtocolV2::handle_client_ident (this=0x55f1e95e7700, payload=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2356
#9  0x000055f1a6e01a2d in ProtocolV2::handle_frame_payload (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1226
#10 0x000055f1a6e00a3c in ProtocolV2::handle_read_frame_dispatch (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1135
#11 0x000055f1a6e02719 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1325
#12 0x000055f1a6e0169e in ProtocolV2::_handle_read_frame_segment (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1194
#13 0x000055f1a6e015ac in ProtocolV2::handle_read_frame_segment (this=0x55f1e95e7700, rx_buffer=..., r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1186
#14 0x000055f1a6e2c47a in CtRxNode<ProtocolV2>::call (this=0x55f1e95e7a88, foo=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Protocol.h:67
#15 0x000055f1a6df4ac0 in ProtocolV2::run_continuation (this=0x55f1e95e7700, continuation=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#16 0x000055f1a6dfbd30 in ProtocolV2::<lambda(char*, int)>::operator()(char *, int) const (__closure=0x55f1f1309fa0, buffer=0x55f1c22e0b00 "\b\001\236", r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:755
#17 0x000055f1a6e167c7 in std::_Function_handler<void(char*, long int), ProtocolV2::read(CONTINUATION_RXBPTR_TYPE<ProtocolV2>&, rx_buffer_t&&)::<lambda(char*, int)> >::_M_invoke(const std::_Any_data &, char *&&, long &&) (__functor=..., __args#0=@0x7f4e4b8341c0: 0x55f1c22e0b00 "\b\001\236", __args#1=@0x7f4e4b8341b8: 0) at /usr/include/c++/8/bits/std_function.h:297
#18 0x000055f1a6dca24d in std::function<void (char*, long)>::operator()(char*, long) const (this=0x55f1f1309fa0, __args#0=0x55f1c22e0b00 "\b\001\236", __args#1=0) at /usr/include/c++/8/bits/std_function.h:687
#19 0x000055f1a6dc4b27 in AsyncConnection::process (this=0x55f1f1309c00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:457
#20 0x000055f1a6dc9644 in C_handle_read::do_request (this=0x55f1cfa0d070, fd_or_id=232) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:71
#21 0x000055f1a6ae7e3a in EventCenter::process_events (this=0x55f1b241d280, timeout_microseconds=30000000, working_dur=0x7f4e4b834578) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Event.cc:406
#22 0x000055f1a6af0826 in NetworkStack::<lambda()>::operator()(void) const (__closure=0x55f1b243bbf8) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Stack.cc:53
#23 0x000055f1a6af1d41 in std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#24 0x000055f1a5ff0dfa in std::function<void ()>::operator()() const (this=0x55f1b243bbf8) at /usr/include/c++/8/bits/std_function.h:687
#25 0x000055f1a6e2e9e1 in std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#26 0x000055f1a6e2e1c5 in std::__invoke<std::function<void ()>>(std::function<void ()>&&) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#27 0x000055f1a6e2ffce in std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x55f1b243bbf8) at /usr/include/c++/8/thread:244
#28 0x000055f1a6e2ffa0 in std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()() (this=0x55f1b243bbf8) at /usr/include/c++/8/thread:253
#29 0x000055f1a6e2ff80 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run() (this=0x55f1b243bbf0) at /usr/include/c++/8/thread:196
#30 0x00007f4e5013cba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#31 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#32 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) f 5
#5  0x000055f1a6acf76f in AsyncMessenger::accept_conn (this=0x55f1b247a800, conn=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:847
847      std::lock_guard l{lock};
(gdb) p lock
$6 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910429, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000]\344\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &lock
$7 = (ceph::mutex *) 0x55f1b247ae40
(gdb) f 7
#7  0x000055f1a6e12234 in ProtocolV2::handle_existing_connection (this=0x55f1e95e7700, existing=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2538
2538        return send_server_ident();
(gdb) p existing
$8 = (const AsyncConnectionRef &) @0x7f4e4b833af8: {
  px = 0x55f1da3b9c00
}
(gdb) p l
$9 = {
  _M_device = @0x55f1da3b9e40
}
(gdb) p l._M_device
$10 = (std::lock_guard<std::mutex>::mutex_type &) @0x55f1da3b9e40: {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 908805, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\005\336\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) f 7
#7  0x000055f1a6e12234 in ProtocolV2::handle_existing_connection (this=0x55f1e95e7700, existing=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2538
2538        return send_server_ident();
(gdb) p existing.px->lock
$15 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 908805, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\005\336\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &existing.px->lock
$16 = (std::mutex *) 0x55f1da3b9e40

thread thread 3(908805) holding 0x55f1da3b9e40 lock, waiting for 35(910429) holding 0x55f1b247ae40 lock

#8 Updated by jianwei zhang almost 2 years ago

thread-35 : holding AsyncMessenger::lock, waiting AsyncConnection::lock
thread-3: holding AsyncConnection::lock, waiting AsyncMessenger::lock

#9 Updated by jianwei zhang almost 2 years ago

910583--wait-->910587: (gdb) t 32 wait MgrClient::lock (owner=910587)
910587--wait-->910429: (gdb) t 62 hold MgrClient::lock, wait OSD::map_lock(owner=910429)
910429--wait-->908805: (gdb) t 35 hold OSD::osd_lock, hold OSD::map_lock, hold AsyncMessenger::lock, wait AsyncConnection::lock(owner=908805)
908805--wait-->910429: (gdb) t 3 hold AsyncConnection::lock, wait AsyncMessenger::lock(owner=910429)
908910--wait-->910429: (gdb) t 9 wait OSD::osd_lock(owner=910429)
908811--wait-->910429: (gdb) t 7 wait OSD::osd_lock(owner=910429)

#10 Updated by jianwei zhang almost 2 years ago

910583: (gdb) t 32 wait MgrClient::lock (owner=910587)
[Switching to thread 32 (Thread 0x7f4e3c6e3700 (LWP 910583))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1b30783b0) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1b30783b0) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a5cab1e2 in std::lock_guard<std::mutex>::lock_guard (this=0x7f4e3c6e0418, __m=...) at /usr/include/c++/8/bits/std_mutex.h:162
#5  0x000055f1a6ba68db in MgrClient::ms_dispatch2 (this=0x55f1b3078230, m=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/mgr/MgrClient.cc:97
#6  0x000055f1a6dbcf31 in Messenger::ms_deliver_dispatch (this=0x55f1b247b800, m=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/Messenger.h:704
#7  0x000055f1a6dbbfaf in DispatchQueue::entry (this=0x55f1b247bb28) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/DispatchQueue.cc:199
#8  0x000055f1a6ad24b2 in DispatchQueue::DispatchThread::entry (this=0x55f1b247bc78) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/DispatchQueue.h:101
#9  0x000055f1a6826dd4 in Thread::entry_wrapper (this=0x55f1b247bc78) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:84
#10 0x000055f1a6826d4e in Thread::_entry_func (arg=0x55f1b247bc78) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:71
#11 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#12 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) f 5
#5  0x000055f1a6ba68db in MgrClient::ms_dispatch2 (this=0x55f1b3078230, m=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/mgr/MgrClient.cc:97
97      std::lock_guard l(lock);
(gdb) p lock
$1 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910587, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\373\344\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &lock
$2 = (ceph::mutex *) 0x55f1b30783b0

class MgrClient : public Dispatcher
{
  ceph::mutex lock = ceph::make_mutex("MgrClient::lock");
};

910587: (gdb) t 62 hold MgrClient::lock, wait OSD::map_lock(owner=910429)
[Switching to thread 62 (Thread 0x7f4e37f64700 (LWP 910587))]
#0  0x00007f4e506b1ec2 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b1ec2 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x000055f1a5d3b0da in std::__shared_mutex_pthread::lock_shared (this=0x55f1b3078f68) at /usr/include/c++/8/shared_mutex:139
#2  0x000055f1a5d3b19a in std::shared_mutex::lock_shared (this=0x55f1b3078f68) at /usr/include/c++/8/shared_mutex:335
#3  0x000055f1a5d784f2 in std::shared_lock<std::shared_mutex>::shared_lock (this=0x7f4e37f61230, __m=...) at /usr/include/c++/8/shared_mutex:553
#4  0x000055f1a5d10712 in OSD::collect_pg_stats (this=0x55f1b3078000) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:7555
#5  0x000055f1a5cdfd29 in OSD::<lambda()>::operator()(void) const (__closure=0x55f1b3078528) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:3523
#6  0x000055f1a5d36435 in std::_Function_handler<MPGStats*(), OSD::init()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:282
#7  0x000055f1a6bae67c in std::function<MPGStats* ()>::operator()() const (this=0x55f1b3078528) at /usr/include/c++/8/bits/std_function.h:687
#8  0x000055f1a6ba9c6c in MgrClient::_send_pgstats (this=0x55f1b3078230) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/mgr/MgrClient.cc:421
#9  0x000055f1a6ba88a8 in MgrClient::_send_stats (this=0x55f1b3078230) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/mgr/MgrClient.cc:285
#10 0x000055f1a6ba8868 in MgrClient::<lambda(int)>::operator()(int) const (__closure=0x55f1d06cd8e8) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/mgr/MgrClient.cc:290
#11 0x000055f1a6bac498 in LambdaContext<MgrClient::_send_stats()::<lambda(int)> >::finish(int) (this=0x55f1d06cd8e0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/include/Context.h:129
#12 0x000055f1a5d4125f in Context::complete (this=0x55f1d06cd8e0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/include/Context.h:77
#13 0x000055f1a6832edd in SafeTimer::timer_thread (this=0x55f1b3078410) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Timer.cc:96
#14 0x000055f1a68344aa in SafeTimerThread::entry (this=0x55f1b439cf30) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Timer.cc:30
#15 0x000055f1a6826dd4 in Thread::entry_wrapper (this=0x55f1b439cf30) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:84
#16 0x000055f1a6826d4e in Thread::_entry_func (arg=0x55f1b439cf30) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:71
#17 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#18 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) f 4
#4  0x000055f1a5d10712 in OSD::collect_pg_stats (this=0x55f1b3078000) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:7555
7555      std::shared_lock l{map_lock};
(gdb) p map_lock
$3 = {
  _M_impl = {
    _M_rwlock = {
      __data = {
        __readers = 19, 
        __writers = 0, 
        __wrphase_futex = 3, 
        __writers_futex = 1, 
        __pad3 = 0, 
        __pad4 = 0, 
        __cur_writer = 910429, 
        __shared = 0, 
        __rwelision = 0 '\000', 
        __pad1 = "\000\000\000\000\000\000", 
        __pad2 = 0, 
        __flags = 0
      }, 
      __size = "\023\000\000\000\000\000\000\000\003\000\000\000\001", '\000' <repeats 11 times>, "]\344\r", '\000' <repeats 28 times>, 
      __align = 19
    }
  }
}
(gdb) p &map_lock
$4 = (ceph::shared_mutex *) 0x55f1b3078f68

MPGStats* OSD::collect_pg_stats()
{
  // This implementation unconditionally sends every is_primary PG's
  // stats every time we're called.  This has equivalent cost to the
  // previous implementation's worst case where all PGs are busy and
  // their stats are always enqueued for sending.
  std::shared_lock l{map_lock};
  ... ...
}

class OSD : public Dispatcher, public md_config_obs_t {
  ceph::shared_mutex map_lock = ceph::make_shared_mutex("OSD::map_lock");
};

(gdb) f 13
#13 0x000055f1a6832edd in SafeTimer::timer_thread (this=0x55f1b3078410) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Timer.cc:96
96        callback->complete(0);
(gdb) p lock
$13 = (ceph::mutex &) @0x55f1b30783b0: {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910587, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\373\344\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &lock
$14 = (ceph::mutex *) 0x55f1b30783b0

MgrClient::MgrClient(CephContext *cct_, Messenger *msgr_, MonMap *monmap_)
  : Dispatcher(cct_), cct(cct_), msgr(msgr_), monmap(monmap_), timer(cct_, lock)
{
  ceph_assert(cct != nullptr);
}

class MgrClient : public Dispatcher
{
  ceph::mutex lock = ceph::make_mutex("MgrClient::lock");
};

910429: (gdb) t 35 hold OSD::osd_lock, hold OSD::map_lock, hold AsyncMessenger::lock, wait AsyncConnection::lock(owner=908805)
[Switching to thread 35 (Thread 0x7f4e3ad5b700 (LWP 910429))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1da3b9e40) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1da3b9e40) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a6dc6ff5 in AsyncConnection::stop (this=0x55f1da3b9c00, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:729
#5  0x000055f1a6acee25 in AsyncMessenger::shutdown_connections (this=0x55f1b247a800, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:792
#6  0x000055f1a6ad3117 in AsyncMessenger::mark_down_all (this=0x55f1b247a800) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.h:160
#7  0x000055f1a6acade0 in AsyncMessenger::rebind (this=0x55f1b247a800, avoid_ports=std::set with 2 elements = {...}) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:431
#8  0x000055f1a5d186b9 in OSD::_committed_osd_maps (this=0x55f1b3078000, first=3957, last=3960, m=0x55f1b98c3c00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:8284
#9  0x000055f1a5d66487 in C_OnMapCommit::finish (this=0x55f1c783fae0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:7704
#10 0x000055f1a5d4125f in Context::complete (this=0x55f1c783fae0, r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/include/Context.h:77
#11 0x000055f1a67eda3b in Finisher::finisher_thread_entry (this=0x55f1b2482478) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Finisher.cc:66
#12 0x000055f1a5d52ba2 in Finisher::FinisherThread::entry (this=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Finisher.h:63
#13 0x000055f1a6826dd4 in Thread::entry_wrapper (this=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:84
#14 0x000055f1a6826d4e in Thread::_entry_func (arg=0x55f1b2482568) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:71
#15 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#16 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) f 4
#4  0x000055f1a6dc6ff5 in AsyncConnection::stop (this=0x55f1da3b9c00, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:729
729      lock.lock();
(gdb) p lock
$13 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 908805, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\005\336\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}

(gdb) p &lock
$3 = (std::mutex *) 0x55f1da3b9e40

void AsyncConnection::stop(bool queue_reset) {
  lock.lock();                    ///0x55f1da3b9e40
  bool need_queue_reset = (state != STATE_CLOSED) && queue_reset;
  protocol->stop();
  lock.unlock();
  if (need_queue_reset) dispatch_queue->queue_reset(this);
}

class AsyncConnection : public Connection {
  std::mutex lock;
};

(gdb) f 5
#5  0x000055f1a6acee25 in AsyncMessenger::shutdown_connections (this=0x55f1b247a800, queue_reset=true) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:792
792        c->stop(queue_reset);
(gdb) p lock
$4 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910429, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000]\344\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &lock
$5 = (ceph::mutex *) 0x55f1b247ae40

void AsyncMessenger::shutdown_connections(bool queue_reset)
{
  ldout(cct,1) << __func__ << " " << dendl;
  std::lock_guard l{lock};        ///0x55f1b247ae40
  for (const auto& c : accepting_conns) {
    ldout(cct, 5) << __func__ << " accepting_conn " << c << dendl;
    c->stop(queue_reset);
  }
  accepting_conns.clear();

  for (const auto& [e, c] : conns) {
    ldout(cct, 5) << __func__ << " mark down " << e << " " << c << dendl;
    c->get_perf_counter()->dec(l_msgr_active_connections);
    c->stop(queue_reset);
  }
  ... ...
}

class AsyncMessenger : public SimplePolicyMessenger {
  ceph::mutex lock = ceph::make_mutex("AsyncMessenger::lock");
};

(gdb) f 8
#8  0x000055f1a5d186b9 in OSD::_committed_osd_maps (this=0x55f1b3078000, first=3957, last=3960, m=0x55f1b98c3c00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:8284
8284        int r = cluster_messenger->rebind(avoid_ports);
(gdb) p map_lock
$5 = {
  _M_impl = {
    _M_rwlock = {
      __data = {
        __readers = 19, 
        __writers = 0, 
        __wrphase_futex = 3, 
        __writers_futex = 1, 
        __pad3 = 0, 
        __pad4 = 0, 
        __cur_writer = 910429, 
        __shared = 0, 
        __rwelision = 0 '\000', 
        __pad1 = "\000\000\000\000\000\000", 
        __pad2 = 0, 
        __flags = 0
      }, 
      __size = "\023\000\000\000\000\000\000\000\003\000\000\000\001", '\000' <repeats 11 times>, "]\344\r", '\000' <repeats 28 times>, 
      __align = 19
    }
  }
}
(gdb) p &map_lock
$6 = (ceph::shared_mutex *) 0x55f1b3078f68

void OSD::_committed_osd_maps(epoch_t first, epoch_t last, MOSDMap *m) {
    dout(10) << __func__ << " " << first << ".." << last << dendl;
    if (is_stopping()) {
        dout(10) << __func__ << " bailing, we are shutting down" << dendl;
        return;
    }
    std::lock_guard l(osd_lock);
    if (is_stopping()) {
        dout(10) << __func__ << " bailing, we are shutting down" << dendl;
        return;
    }
    map_lock.lock();
    ... ...
}

908805: (gdb) t 3 hold AsyncConnection::lock, wait AsyncMessenger::lock(owner=910429)
[Switching to thread 3 (Thread 0x7f4e4b837700 (LWP 908805))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1b247ae40) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1b247ae40) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a5cab1e2 in std::lock_guard<std::mutex>::lock_guard (this=0x7f4e4b8336f8, __m=...) at /usr/include/c++/8/bits/std_mutex.h:162
#5  0x000055f1a6acf76f in AsyncMessenger::accept_conn (this=0x55f1b247a800, conn=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:847
#6  0x000055f1a6e147ef in ProtocolV2::send_server_ident (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2807
#7  0x000055f1a6e12234 in ProtocolV2::handle_existing_connection (this=0x55f1e95e7700, existing=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2538
#8  0x000055f1a6e0f456 in ProtocolV2::handle_client_ident (this=0x55f1e95e7700, payload=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2356
#9  0x000055f1a6e01a2d in ProtocolV2::handle_frame_payload (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1226
#10 0x000055f1a6e00a3c in ProtocolV2::handle_read_frame_dispatch (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1135
#11 0x000055f1a6e02719 in ProtocolV2::_handle_read_frame_epilogue_main (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1325
#12 0x000055f1a6e0169e in ProtocolV2::_handle_read_frame_segment (this=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1194
#13 0x000055f1a6e015ac in ProtocolV2::handle_read_frame_segment (this=0x55f1e95e7700, rx_buffer=..., r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:1186
#14 0x000055f1a6e2c47a in CtRxNode<ProtocolV2>::call (this=0x55f1e95e7a88, foo=0x55f1e95e7700) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Protocol.h:67
#15 0x000055f1a6df4ac0 in ProtocolV2::run_continuation (this=0x55f1e95e7700, continuation=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#16 0x000055f1a6dfbd30 in ProtocolV2::<lambda(char*, int)>::operator()(char *, int) const (__closure=0x55f1f1309fa0, buffer=0x55f1c22e0b00 "\b\001\236", r=0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:755
#17 0x000055f1a6e167c7 in std::_Function_handler<void(char*, long int), ProtocolV2::read(CONTINUATION_RXBPTR_TYPE<ProtocolV2>&, rx_buffer_t&&)::<lambda(char*, int)> >::_M_invoke(const std::_Any_data &, char *&&, long &&) (__functor=..., __args#0=@0x7f4e4b8341c0: 0x55f1c22e0b00 "\b\001\236", __args#1=@0x7f4e4b8341b8: 0) at /usr/include/c++/8/bits/std_function.h:297
#18 0x000055f1a6dca24d in std::function<void (char*, long)>::operator()(char*, long) const (this=0x55f1f1309fa0, __args#0=0x55f1c22e0b00 "\b\001\236", __args#1=0) at /usr/include/c++/8/bits/std_function.h:687
#19 0x000055f1a6dc4b27 in AsyncConnection::process (this=0x55f1f1309c00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:457
#20 0x000055f1a6dc9644 in C_handle_read::do_request (this=0x55f1cfa0d070, fd_or_id=232) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncConnection.cc:71
#21 0x000055f1a6ae7e3a in EventCenter::process_events (this=0x55f1b241d280, timeout_microseconds=30000000, working_dur=0x7f4e4b834578) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Event.cc:406
#22 0x000055f1a6af0826 in NetworkStack::<lambda()>::operator()(void) const (__closure=0x55f1b243bbf8) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/Stack.cc:53
#23 0x000055f1a6af1d41 in std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#24 0x000055f1a5ff0dfa in std::function<void ()>::operator()() const (this=0x55f1b243bbf8) at /usr/include/c++/8/bits/std_function.h:687
#25 0x000055f1a6e2e9e1 in std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#26 0x000055f1a6e2e1c5 in std::__invoke<std::function<void ()>>(std::function<void ()>&&) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#27 0x000055f1a6e2ffce in std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x55f1b243bbf8) at /usr/include/c++/8/thread:244
#28 0x000055f1a6e2ffa0 in std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()() (this=0x55f1b243bbf8) at /usr/include/c++/8/thread:253
#29 0x000055f1a6e2ff80 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run() (this=0x55f1b243bbf0) at /usr/include/c++/8/thread:196
#30 0x00007f4e5013cba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#31 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#32 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) f 5
#5  0x000055f1a6acf76f in AsyncMessenger::accept_conn (this=0x55f1b247a800, conn=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/AsyncMessenger.cc:847
847      std::lock_guard l{lock};
(gdb) p lock
$6 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910429, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000]\344\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &lock
$7 = (ceph::mutex *) 0x55f1b247ae40

int AsyncMessenger::accept_conn(const AsyncConnectionRef& conn)
{
  std::lock_guard l{lock};            ///0x55f1b247ae40 
  if (conn->policy.server &&
      conn->policy.lossy &&
      !conn->policy.register_lossy_clients) {
    anon_conns.insert(conn);
    conn->get_perf_counter()->inc(l_msgr_active_connections);
    return 0;
  }
  auto it = conns.find(*conn->peer_addrs);
  if (it != conns.end()) {
    auto& existing = it->second;

    // lazy delete, see "deleted_conns" 
    // If conn already in, we will return 0
    std::lock_guard l{deleted_lock};
    if (deleted_conns.erase(existing)) {
      conns.erase(it);
    } else if (conn != existing) {
    ... ...
}

class AsyncMessenger : public SimplePolicyMessenger {
  ceph::mutex lock = ceph::make_mutex("AsyncMessenger::lock");
};

(gdb) f 7
#7  0x000055f1a6e12234 in ProtocolV2::handle_existing_connection (this=0x55f1e95e7700, existing=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2538
2538        return send_server_ident();
(gdb) p existing
$8 = (const AsyncConnectionRef &) @0x7f4e4b833af8: {
  px = 0x55f1da3b9c00
}
(gdb) p l
$9 = {
  _M_device = @0x55f1da3b9e40
}
(gdb) p l._M_device
$10 = (std::lock_guard<std::mutex>::mutex_type &) @0x55f1da3b9e40: {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 908805, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\005\336\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) f 7
#7  0x000055f1a6e12234 in ProtocolV2::handle_existing_connection (this=0x55f1e95e7700, existing=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/msg/async/ProtocolV2.cc:2538
2538        return send_server_ident();
(gdb) p existing.px->lock
$15 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 908805, 
        __nusers = 1, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000\005\336\r\000\001", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &existing.px->lock
$16 = (std::mutex *) 0x55f1da3b9e40

CtPtr ProtocolV2::handle_existing_connection(const AsyncConnectionRef& existing) {
  ldout(cct, 20) << __func__ << " existing=" << existing << dendl;

  std::lock_guard<std::mutex> l(existing->lock);  //0x55f1da3b9e40

  ProtocolV2 *exproto = dynamic_cast<ProtocolV2 *>(existing->protocol.get());
  if (!exproto) {
    ldout(cct, 1) << __func__ << " existing=" << existing << dendl;
    ceph_assert(false);
  }

  if (exproto->state == CLOSED) {
    ldout(cct, 1) << __func__ << " existing " << existing << " already closed." 
                  << dendl;
    return send_server_ident();
  }
  ... ...
}

class AsyncConnection : public Connection {
  std::mutex lock;
};

908910: (gdb) t 9 wait OSD::osd_lock(owner=910429)
[Switching to thread 9 (Thread 0x7f4e49032700 (LWP 908910))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506b18e7 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2  0x00007f4e506b34fc in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3  0x000055f1a5cbdaa1 in __gthread_cond_timedwait (__cond=0x55f1b3078058, __mutex=0x55f1b3078018, __abs_timeout=0x7f4e4902f530) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:871
#4  0x000055f1a5dff05e in std::condition_variable::__wait_until_impl<std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > (this=0x55f1b3078058, __lock=..., __atime=...) at /usr/include/c++/8/condition_variable:178
#5  0x000055f1a68349e8 in std::condition_variable::wait_until<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > (this=0x55f1b3078058, __lock=..., __atime=...) at /usr/include/c++/8/condition_variable:119
#6  0x000055f1a6833075 in SafeTimer::timer_thread (this=0x55f1b3078040) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Timer.cc:108
#7  0x000055f1a68344aa in SafeTimerThread::entry (this=0x55f1b24e05d0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Timer.cc:30
#8  0x000055f1a6826dd4 in Thread::entry_wrapper (this=0x55f1b24e05d0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:84
#9  0x000055f1a6826d4e in Thread::_entry_func (arg=0x55f1b24e05d0) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.cc:71
#10 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#11 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) f 6
#6  0x000055f1a6833075 in SafeTimer::timer_thread (this=0x55f1b3078040) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Timer.cc:108
108          cond.wait_until(l, schedule.begin()->first);
(gdb) p lock
$7 = (ceph::mutex &) @0x55f1b3078018: {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910429, 
        __nusers = 2, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000]\344\r\000\002", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &lock
$8 = (ceph::mutex *) 0x55f1b3078018

OSD::OSD(CephContext *cct_, ObjectStore *store_, int id, Messenger *internal_messenger, Messenger *external_messenger,
         Messenger *hb_client_front, Messenger *hb_client_back, Messenger *hb_front_serverm, Messenger *hb_back_serverm,
         Messenger *osdc_messenger, MonClient *mc, const std::string &dev, const std::string &jdev)
    : Dispatcher(cct_),
      tick_timer(cct, osd_lock),

class OSD : public Dispatcher, public md_config_obs_t {
  /** OSD **/
  // global lock
  ceph::mutex osd_lock = ceph::make_mutex("OSD::osd_lock");
  SafeTimer tick_timer;    // safe timer (osd_lock)
};

SafeTimer::SafeTimer(CephContext *cct_, ceph::mutex &l, bool safe_callbacks)
  : cct(cct_), lock(l), safe_callbacks(safe_callbacks), thread(NULL), stopping(false) { }

908811: (gdb) t 7 wait OSD::osd_lock(owner=910429)
[Switching to thread 7 (Thread 0x7f4e4a835700 (LWP 908811))]
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4e506b665d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4e506af979 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055f1a5c9e9da in __gthread_mutex_lock (__mutex=0x55f1b3078018) at /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:748
#3  0x000055f1a5ca8c4c in std::mutex::lock (this=0x55f1b3078018) at /usr/include/c++/8/bits/std_mutex.h:103
#4  0x000055f1a5cab1e2 in std::lock_guard<std::mutex>::lock_guard (this=0x7f4e4a831020, __m=...) at /usr/include/c++/8/bits/std_mutex.h:162
#5  0x000055f1a5cdbaa8 in OSD::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>) (this=0x55f1b3078000, prefix="dump_pg_recovery_stats", cmdmap=std::map with 1 element = {...}, f=0x55f1f028ee00, inbl=..., on_finish=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:2918
#6  0x000055f1a5d64ed2 in OSDSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>) (this=0x55f1b4b4c510, prefix="dump_pg_recovery_stats", cmdmap=std::map with 1 element = {...}, f=0x55f1f028ee00, inbl=..., on_finish=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:2363
#7  0x000055f1a684437c in AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>) (this=0x55f1b2404a00, cmdvec=std::vector of length 1, capacity 1 = {...}, inbl=..., on_finish=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/admin_socket.cc:509
#8  0x000055f1a6843865 in AdminSocket::execute_command (this=0x55f1b2404a00, cmd=std::vector of length 1, capacity 1 = {...}, inbl=..., errss=..., outbl=0x7f4e4a831c10) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/admin_socket.cc:454
#9  0x000055f1a6842780 in AdminSocket::do_accept (this=0x55f1b2404a00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/admin_socket.cc:367
#10 0x000055f1a68416bf in AdminSocket::entry (this=0x55f1b2404a00) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/admin_socket.cc:255
#11 0x000055f1a684e2d5 in std::__invoke_impl<void, void (AdminSocket::*)() noexcept, AdminSocket*>(std::__invoke_memfun_deref, void (AdminSocket::*&&)() noexcept, AdminSocket*&&) (__f=@0x55f1b30b0090: (void (AdminSocket::*)(AdminSocket * const)) 0x55f1a68411be <AdminSocket::entry()>, __t=@0x55f1b30b0088: 0x55f1b2404a00) at /usr/include/c++/8/bits/invoke.h:73
#12 0x000055f1a684daf6 in std::__invoke<void (AdminSocket::*)() noexcept, AdminSocket*>(void (AdminSocket::*&&)() noexcept, AdminSocket*&&) (__fn=@0x55f1b30b0090: (void (AdminSocket::*)(AdminSocket * const)) 0x55f1a68411be <AdminSocket::entry()>) at /usr/include/c++/8/bits/invoke.h:95
#13 0x000055f1a684cd29 in std::invoke<void (AdminSocket::*)() noexcept, AdminSocket*>(void (AdminSocket::*&&)() noexcept, AdminSocket*&&) (__fn=@0x55f1b30b0090: (void (AdminSocket::*)(AdminSocket * const)) 0x55f1a68411be <AdminSocket::entry()>) at /usr/include/c++/8/functional:81
#14 0x000055f1a684b8eb in _ZZ17make_named_threadIM11AdminSocketDoFvvEJPS0_EESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_ENKUlSA_SD_E_clIS2_JS3_EEEDaSA_SD_ (this=0x55f1b30b00a0, fun=@0x55f1b30b0090: (void (AdminSocket::*)(AdminSocket * const)) 0x55f1a68411be <AdminSocket::entry()>) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/common/Thread.h:79
#15 0x000055f1a684cf97 in _ZSt13__invoke_implIvZ17make_named_threadIM11AdminSocketDoFvvEJPS1_EESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSB_SE_E_JS3_S4_EESA_St14__invoke_otherOT0_DpOT1_ (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#16 0x000055f1a684b96c in _ZSt8__invokeIZ17make_named_threadIM11AdminSocketDoFvvEJPS1_EESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSB_SE_E_JS3_S4_EENSt15__invoke_resultISA_JDpSC_EE4typeESB_SE_ (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#17 0x000055f1a684efab in _ZNSt6thread8_InvokerISt5tupleIJZ17make_named_threadIM11AdminSocketDoFvvEJPS3_EES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSC_SF_E_S5_S6_EEE9_M_invokeIJLm0ELm1ELm2EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE (this=0x55f1b30b0088) at /usr/include/c++/8/thread:244
#18 0x000055f1a684ef46 in _ZNSt6thread8_InvokerISt5tupleIJZ17make_named_threadIM11AdminSocketDoFvvEJPS3_EES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSC_SF_E_S5_S6_EEEclEv (this=0x55f1b30b0088) at /usr/include/c++/8/thread:253
#19 0x000055f1a684ef26 in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZ17make_named_threadIM11AdminSocketDoFvvEJPS4_EES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSD_SG_E_S6_S7_EEEEE6_M_runEv (this=0x55f1b30b0080) at /usr/include/c++/8/thread:196
#20 0x00007f4e5013cba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#21 0x00007f4e506ad14a in start_thread () from /lib64/libpthread.so.0
#22 0x00007f4e4f817dc3 in clone () from /lib64/libc.so.6
(gdb) f 5
#5  0x000055f1a5cdbaa8 in OSD::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>) (this=0x55f1b3078000, prefix="dump_pg_recovery_stats", cmdmap=std::map with 1 element = {...}, f=0x55f1f028ee00, inbl=..., on_finish=...) at /usr/src/debug/ceph-15.2.13-tag_GP2.0_202205132154.el8.x86_64/src/osd/OSD.cc:2918
2918        lock_guard l(osd_lock);
(gdb) p osd_lock
$9 = {
  <std::__mutex_base> = {
    _M_mutex = {
      __data = {
        __lock = 2, 
        __count = 0, 
        __owner = 910429, 
        __nusers = 2, 
        __kind = 0, 
        __spins = 0, 
        __elision = 0, 
        __list = {
          __prev = 0x0, 
          __next = 0x0
        }
      }, 
      __size = "\002\000\000\000\000\000\000\000]\344\r\000\002", '\000' <repeats 26 times>, 
      __align = 2
    }
  }, <No data fields>}
(gdb) p &osd_lock
$10 = (ceph::mutex *) 0x55f1b3078018

void OSD::asok_command(std::string_view prefix, const cmdmap_t &cmdmap, Formatter *f, const bufferlist &inbl, std::function<void(int, const std::string &, bufferlist &)> on_finish)
{
  ... ...
  else if (prefix == "dump_pg_recovery_stats") {
    lock_guard l(osd_lock);
    pg_recovery_stats.dump_formatted(f);
  }
  ... ...
}

#11 Updated by jianwei zhang almost 2 years ago

I think this problem may be a problem with ProtocolV2

thread-35  
AsyncMessenger::shutdown_connections         hold             AsyncMessenger::lock            std::lock_guard l{lock}
AsyncConnection::stop                         wait                AsyncConnection::lock            lock.lock()

thread-3
ProtocolV2::handle_existing_connection        hold                AsyncConnection::lock            std::lock_guard<std::mutex> l(existing->lock)
AsyncMessenger::accept_conn                wait                AsyncMessenger::lock            std::lock_guard l{lock}

#12 Updated by Radoslaw Zarzynski almost 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46732

#13 Updated by Radoslaw Zarzynski almost 2 years ago

Big, big thanks, jianwei zhang, for your analysis. It was extremely helpful!

#14 Updated by jianwei zhang almost 2 years ago

Radoslaw Zarzynski wrote:

Big, big thanks, jianwei zhang, for your analysis. It was extremely helpful!

Good news, the problem has a solution, thank you

#15 Updated by Radoslaw Zarzynski over 1 year ago

  • Backport set to pacific,quincy

#16 Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#17 Updated by Backport Bot over 1 year ago

#18 Updated by Backport Bot over 1 year ago

#19 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#20 Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • Target version set to v18.0.0
  • % Done changed from 0 to 100
  • Tags deleted (backport_processed)

Also available in: Atom PDF