Bug #15758
closedmsgr/async: Messenger thread long time lock hold risk
0%
Description
Thread 46 (Thread 0x7fa7365a2700 (LWP 17842)):
#0 0x00007fa73b888705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000009e135d in FileStore::op_queue_reserve_throttle(FileStore::Op*, ThreadPool::TPHandle*) ()
#2 0x00000000009f16e5 in FileStore::queue_transactions(ObjectStore::Sequencer*, std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#3 0x00000000006e9b19 in ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction*, Context*, Context*, Context*, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#4 0x00000000006a4f42 in OSD::dispatch_context(PG::RecoveryCtx&, PG*, std::tr1::shared_ptr<OSDMap const>, ThreadPool::TPHandle*) ()
#5 0x00000000006c4833 in OSD::handle_pg_peering_evt(spg_t, pg_info_t const&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, unsigned int, pg_shard_t, bool, std::tr1::shared_ptr<PG::CephPeeringEvt>) ()
#6 0x00000000006c67b3 in OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>) ()
#7 0x00000000006c8cc0 in OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>) ()
#8 0x00000000006c9ede in OSD::_dispatch(Message*) ()
#9 0x00000000006ca5c7 in OSD::ms_dispatch(Message*) ()
#10 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#11 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#12 0x0000000000d109a5 in EventCenter::process_events(int) ()
#13 0x0000000000cece68 in Worker::entry() ()
#14 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fa739c411ad in clone () from /lib64/libc.so.6
Thread 45 (Thread 0x7fa735da1700 (LWP 17843)):
#0 0x00007fa73b88af7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fa73b886d32 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fa73b886c38 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000bfdbe8 in Mutex::Lock(bool) ()
#4 0x00000000006ca3a5 in OSD::ms_dispatch(Message*) ()
#5 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#6 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#7 0x0000000000d109a5 in EventCenter::process_events(int) ()
#8 0x0000000000cece68 in Worker::entry() ()
#9 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa739c411ad in clone () from /lib64/libc.so.6
osd_lock is main problem, in heavy io pressure. It's easy to block messenger thread dispatch message for seconds. If we create a big pool under heavy io pressure, it's easy to let osd down/in because of heartbeat message blocking. It will be more easy to show up in async messenger than simple messenger.
Updated by Haomai Wang almost 8 years ago
The lock time depends on the ObjectStore process time from 1s-5s even longer. It's easy to cause IO shake when exists non-fast dispatch message in.
Updated by Haomai Wang almost 8 years ago
It looks like async messenger need a extra DispatchQueue to handle nonfast message which may block thread totally.
Updated by Haomai Wang almost 8 years ago
- Subject changed from Messenger thread long time lock hold risk to msgr/async: Messenger thread long time lock hold risk
- Category set to msgr
Updated by Greg Farnum almost 8 years ago
Is this something that newly blocks for a long time?
Or is the problem that AsyncMessenger doesn't have per-connection threads and so all the other message processing gets blocked up? (And so this has been a problem for a long time, but SimpleMessenger masked it.)
Updated by Haomai Wang almost 8 years ago
yes, async messenger always easy to expose problem than simple. actually this pr(https://github.com/ceph/ceph/pull/8808) help a lot. If FileStore is busy in sync, a nonfast message like pg_log(from create pool action) will hold osd_lock and stuck into filestore condition wait. After we have OSDPing message fast dispatch, it could avoid osd down because of heartbeat timeout while ping message in the queue.
Updated by Sage Weil almost 8 years ago
- Status changed from New to Pending Backport
- Backport set to jewel
let's take our time backporting this... it should bake in master for a while first!
Updated by Nathan Cutler almost 8 years ago
- Copied to Backport #16377: jewel: msgr/async: Messenger thread long time lock hold risk added
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved
Updated by Greg Farnum about 5 years ago
- Project changed from Ceph to Messengers
- Category deleted (
msgr)