Project

General

Profile

Actions

Bug #15440

closed

msg/async: deadlock on delayed delivery?

Added by Sage Weil about 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Thread 2 (Thread 0x7fa196ba9700 (LWP 6954)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa19d6f2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa19d6f2480 in __GI___pthread_mutex_lock (mutex=0x7fa1ab41daf0) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fa19f3ddf68 in Mutex::Lock(bool) ()
#4  0x00007fa19f53f07b in AsyncConnection::handle_connect_msg(ceph_msg_connect&, ceph::buffer::list&, ceph::buffer::list&) ()
#5  0x00007fa19f542b7c in AsyncConnection::_process_connection() ()
#6  0x00007fa19f548cd8 in AsyncConnection::process() ()
#7  0x00007fa19f4ed0a5 in EventCenter::process_events(int) ()
#8  0x00007fa19f4ce350 in Worker::entry() ()
#9  0x00007fa19d6f0182 in start_thread (arg=0x7fa196ba9700) at pthread_create.c:312
#10 0x00007fa19b81e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fa1973aa700 (LWP 6953)):
#0  0x00007fa19d6f166b in pthread_join (threadid=140331299120896, thread_return=0x0) at pthread_join.c:92
#1  0x00007fa19f411c80 in Thread::join(void**) ()
#2  0x00007fa19f539c44 in AsyncConnection::_stop() ()
#3  0x00007fa19f53c8c6 in AsyncConnection::fault() ()
#4  0x00007fa19f548abd in AsyncConnection::process() ()
#5  0x00007fa19f4ed0a5 in EventCenter::process_events(int) ()
#6  0x00007fa19f4ce350 in Worker::entry() ()
#7  0x00007fa19d6f0182 in start_thread (arg=0x7fa1973aa700) at pthread_create.c:312
#8  0x00007fa19b81e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7fa1963a8700 (LWP 6955)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa19d6f2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa19d6f2480 in __GI___pthread_mutex_lock (mutex=0x7fa1ab2d6028) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fa19f3ddf68 in Mutex::Lock(bool) ()
#4  0x00007fa19edfb8b2 in OSD::ms_dispatch(Message*) ()
#5  0x00007fa19f54c8b7 in Messenger::ms_deliver_dispatch(Message*) ()
#6  0x00007fa19f54d011 in C_handle_dispatch::do_request(int) ()
#7  0x00007fa19f4ed76d in EventCenter::process_events(int) ()
#8  0x00007fa19f4ce350 in Worker::entry() ()
#9  0x00007fa19d6f0182 in start_thread (arg=0x7fa1963a8700) at pthread_create.c:312
#10 0x00007fa19b81e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 14 (Thread 0x7fa18fb9b700 (LWP 6968)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa19d6f2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa19d6f2480 in __GI___pthread_mutex_lock (mutex=0x7fa1ab6da530) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fa19f3ddf68 in Mutex::Lock(bool) ()
#4  0x00007fa19ee67cc0 in PG::lock(bool) const ()
#5  0x00007fa19edef8ce in OSD::consume_map() ()
#6  0x00007fa19edf2d7e in OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) ()
#7  0x00007fa19ee03009 in Context::complete(int) ()
#8  0x00007fa19f3608b6 in Finisher::finisher_thread_entry() ()
#9  0x00007fa19d6f0182 in start_thread (arg=0x7fa18fb9b700) at pthread_create.c:312
#10 0x00007fa19b81e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 28 (Thread 0x7fa184203700 (LWP 7032)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa19d6f2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#2  0x00007fa19d6f2480 in __GI___pthread_mutex_lock (mutex=0x7fa1ab41daf0) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fa19f3ddf68 in Mutex::Lock(bool) ()
#4  0x00007fa19f54c2aa in AsyncConnection::is_connected() ()
#5  0x00007fa19edaf636 in OSD::op_is_discardable(MOSDOp*) ()
#6  0x00007fa19ee7b809 in PG::can_discard_op(std::shared_ptr<OpRequest>&) ()
#7  0x00007fa19ee7bef5 in PG::can_discard_request(std::shared_ptr<OpRequest>&) ()
#8  0x00007fa19ef15766 in ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&) ()
#9  0x00007fa19edd6415 in OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&) ()
#10 0x00007fa19edd663d in PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&) ()
#11 0x00007fa19eddb049 in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ()
#12 0x00007fa19f41e8f7 in ShardedThreadPool::shardedthreadpool_worker(unsigned int) ()
#13 0x00007fa19f420820 in ShardedThreadPool::WorkThreadSharded::entry() ()
#14 0x00007fa19d6f0182 in start_thread (arg=0x7fa184203700) at pthread_create.c:312
#15 0x00007fa19b81e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 32 (Thread 0x7fa16d3c2700 (LWP 8876)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa19d6f2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa19d6f2480 in __GI___pthread_mutex_lock (mutex=0x7fa1ab41daf0) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fa19f3ddf68 in Mutex::Lock(bool) ()
#4  0x00007fa19f54c2aa in AsyncConnection::is_connected() ()
#5  0x00007fa19edaf636 in OSD::op_is_discardable(MOSDOp*) ()
#6  0x00007fa19ede2eba in OSD::handle_op(std::shared_ptr<OpRequest>&, std::shared_ptr<OSDMap const>&) ()
#7  0x00007fa19ede40de in OSD::dispatch_op_fast(std::shared_ptr<OpRequest>&, std::shared_ptr<OSDMap const>&) ()
#8  0x00007fa19ede43d8 in OSD::dispatch_session_waiting(OSD::Session*, std::shared_ptr<OSDMap const>) ()
#9  0x00007fa19ede4724 in OSD::ms_fast_dispatch(Message*) ()
#10 0x00007fa19f535f55 in AsyncConnection::DelayedDelivery::entry() ()
#11 0x00007fa19d6f0182 in start_thread (arg=0x7fa16d3c2700) at pthread_create.c:312
#12 0x00007fa19b81e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 102 (Thread 0x7fa186a08700 (LWP 7027)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa19d6f2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa19d6f2480 in __GI___pthread_mutex_lock (mutex=0x7fa1ab6da530) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fa19f3ddf68 in Mutex::Lock(bool) ()
#4  0x00007fa19ee67cc0 in PG::lock(bool) const ()
#5  0x00007fa19ee67e4e in PG::lock_suspend_timeout(ThreadPool::TPHandle&) ()
#6  0x00007fa19eddaabe in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ()
#7  0x00007fa19f41e8f7 in ShardedThreadPool::shardedthreadpool_worker(unsigned int) ()
#8  0x00007fa19f420820 in ShardedThreadPool::WorkThreadSharded::entry() ()
#9  0x00007fa19d6f0182 in start_thread (arg=0x7fa186a08700) at pthread_create.c:312
#10 0x00007fa19b81e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

/a/sage-2016-04-07_14:41:26-rados-wip-sage-testing-distro-basic-smithi/114688

Actions #1

Updated by Sage Weil about 8 years ago

Actions #2

Updated by Haomai Wang about 8 years ago

This bug is a little like http://tracker.ceph.com/issues/15412 .

The dead lock case is:

1. conn 1 hold the connection's lock and require a another conn 2 lock when replacing
2. conn 2 who is dispatch normal(not fast) message already hold connection's lock, and it's requiring pg lock
3. Delay Deliver for fast dispatch hold pg lock and want to require conn 1's connection lock

dead lock happened, it won't happen before because each connection owned by dedicated thread do dispatching, it won't have another thread accquire connection lock.

The theory perfect fix is only not have another thread who can deliver message. This also can apply for simple.
The incomplete fix is we remove lock require in is_connected, so it avoid callee require upper lock(connection's lock). Of course, it still exist mark_down api which require lock.

Actions #4

Updated by Haomai Wang almost 8 years ago

  • Status changed from 12 to Resolved
Actions

Also available in: Atom PDF