Bug #15758: msgr/async: Messenger thread long time lock hold risk - Messengers - Ceph

Actions

Copy link

Bug #15758

closed

msgr/async: Messenger thread long time lock hold risk

Added by Haomai Wang almost 8 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Thread 46 (Thread 0x7fa7365a2700 (LWP 17842)):
#0 0x00007fa73b888705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000009e135d in FileStore::op_queue_reserve_throttle(FileStore::Op*, ThreadPool::TPHandle*) ()
#2 0x00000000009f16e5 in FileStore::queue_transactions(ObjectStore::Sequencer*, std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#3 0x00000000006e9b19 in ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction*, Context*, Context*, Context*, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#4 0x00000000006a4f42 in OSD::dispatch_context(PG::RecoveryCtx&, PG*, std::tr1::shared_ptr<OSDMap const>, ThreadPool::TPHandle*) ()
#5 0x00000000006c4833 in OSD::handle_pg_peering_evt(spg_t, pg_info_t const&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, unsigned int, pg_shard_t, bool, std::tr1::shared_ptr<PG::CephPeeringEvt>) ()
#6 0x00000000006c67b3 in OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>) ()
#7 0x00000000006c8cc0 in OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>) ()
#8 0x00000000006c9ede in OSD::_dispatch(Message*) ()
#9 0x00000000006ca5c7 in OSD::ms_dispatch(Message*) ()
#10 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#11 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#12 0x0000000000d109a5 in EventCenter::process_events(int) ()
#13 0x0000000000cece68 in Worker::entry() ()
#14 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fa739c411ad in clone () from /lib64/libc.so.6
Thread 45 (Thread 0x7fa735da1700 (LWP 17843)):
#0 0x00007fa73b88af7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fa73b886d32 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fa73b886c38 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000bfdbe8 in Mutex::Lock(bool) ()
#4 0x00000000006ca3a5 in OSD::ms_dispatch(Message*) ()
#5 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#6 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#7 0x0000000000d109a5 in EventCenter::process_events(int) ()
#8 0x0000000000cece68 in Worker::entry() ()
#9 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa739c411ad in clone () from /lib64/libc.so.6

osd_lock is main problem, in heavy io pressure. It's easy to block messenger thread dispatch message for seconds. If we create a big pool under heavy io pressure, it's easy to let osd down/in because of heartbeat message blocking. It will be more easy to show up in async messenger than simple messenger.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

The lock time depends on the ObjectStore process time from 1s-5s even longer. It's easy to cause IO shake when exists non-fast dispatch message in.

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

It looks like async messenger need a extra DispatchQueue to handle nonfast message which may block thread totally.

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

Subject changed from Messenger thread long time lock hold risk to msgr/async: Messenger thread long time lock hold risk
Category set to msgr

Actions

Copy link

Updated by Greg Farnum almost 8 years ago

Is this something that newly blocks for a long time?

Or is the problem that AsyncMessenger doesn't have per-connection threads and so all the other message processing gets blocked up? (And so this has been a problem for a long time, but SimpleMessenger masked it.)

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

yes, async messenger always easy to expose problem than simple. actually this pr(https://github.com/ceph/ceph/pull/8808) help a lot. If FileStore is busy in sync, a nonfast message like pg_log(from create pool action) will hold osd_lock and stuck into filestore condition wait. After we have OSDPing message fast dispatch, it could avoid osd down because of heartbeat timeout while ping message in the queue.

Actions

Copy link