Project

General

Profile

Bug #15758

msgr/async: Messenger thread long time lock hold risk

Added by Haomai Wang 10 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
msgr
Target version:
-
Start date:
05/06/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:

Description

Thread 46 (Thread 0x7fa7365a2700 (LWP 17842)):
#0 0x00007fa73b888705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000009e135d in FileStore::op_queue_reserve_throttle(FileStore::Op*, ThreadPool::TPHandle*) ()
#2 0x00000000009f16e5 in FileStore::queue_transactions(ObjectStore::Sequencer*, std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#3 0x00000000006e9b19 in ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction*, Context*, Context*, Context*, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#4 0x00000000006a4f42 in OSD::dispatch_context(PG::RecoveryCtx&, PG*, std::tr1::shared_ptr<OSDMap const>, ThreadPool::TPHandle*) ()
#5 0x00000000006c4833 in OSD::handle_pg_peering_evt(spg_t, pg_info_t const&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, unsigned int, pg_shard_t, bool, std::tr1::shared_ptr<PG::CephPeeringEvt>) ()
#6 0x00000000006c67b3 in OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>) ()
#7 0x00000000006c8cc0 in OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>) ()
#8 0x00000000006c9ede in OSD::_dispatch(Message*) ()
#9 0x00000000006ca5c7 in OSD::ms_dispatch(Message*) ()
#10 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#11 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#12 0x0000000000d109a5 in EventCenter::process_events(int) ()
#13 0x0000000000cece68 in Worker::entry() ()
#14 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fa739c411ad in clone () from /lib64/libc.so.6
Thread 45 (Thread 0x7fa735da1700 (LWP 17843)):
#0 0x00007fa73b88af7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fa73b886d32 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fa73b886c38 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000bfdbe8 in Mutex::Lock(bool) ()
#4 0x00000000006ca3a5 in OSD::ms_dispatch(Message*) ()
#5 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#6 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#7 0x0000000000d109a5 in EventCenter::process_events(int) ()
#8 0x0000000000cece68 in Worker::entry() ()
#9 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa739c411ad in clone () from /lib64/libc.so.6

osd_lock is main problem, in heavy io pressure. It's easy to block messenger thread dispatch message for seconds. If we create a big pool under heavy io pressure, it's easy to let osd down/in because of heartbeat message blocking. It will be more easy to show up in async messenger than simple messenger.


Related issues

Copied to Backport #16377: jewel: msgr/async: Messenger thread long time lock hold risk Resolved

History

#1 Updated by Haomai Wang 10 months ago

The lock time depends on the ObjectStore process time from 1s-5s even longer. It's easy to cause IO shake when exists non-fast dispatch message in.

#2 Updated by Haomai Wang 10 months ago

It looks like async messenger need a extra DispatchQueue to handle nonfast message which may block thread totally.

#3 Updated by Haomai Wang 10 months ago

  • Subject changed from Messenger thread long time lock hold risk to msgr/async: Messenger thread long time lock hold risk
  • Category set to msgr

#4 Updated by Greg Farnum 10 months ago

Is this something that newly blocks for a long time?

Or is the problem that AsyncMessenger doesn't have per-connection threads and so all the other message processing gets blocked up? (And so this has been a problem for a long time, but SimpleMessenger masked it.)

#5 Updated by Haomai Wang 10 months ago

yes, async messenger always easy to expose problem than simple. actually this pr(https://github.com/ceph/ceph/pull/8808) help a lot. If FileStore is busy in sync, a nonfast message like pg_log(from create pool action) will hold osd_lock and stuck into filestore condition wait. After we have OSDPing message fast dispatch, it could avoid osd down because of heartbeat timeout while ping message in the queue.

#6 Updated by Sage Weil 8 months ago

  • Status changed from New to Pending Backport
  • Backport set to jewel

let's take our time backporting this... it should bake in master for a while first!

#7 Updated by Nathan Cutler 8 months ago

  • Copied to Backport #16377: jewel: msgr/async: Messenger thread long time lock hold risk added

#8 Updated by Loic Dachary 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF