Project

General

Profile

Actions

Bug #15758

closed

msgr/async: Messenger thread long time lock hold risk

Added by Haomai Wang almost 8 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Thread 46 (Thread 0x7fa7365a2700 (LWP 17842)):
#0 0x00007fa73b888705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000009e135d in FileStore::op_queue_reserve_throttle(FileStore::Op*, ThreadPool::TPHandle*) ()
#2 0x00000000009f16e5 in FileStore::queue_transactions(ObjectStore::Sequencer*, std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#3 0x00000000006e9b19 in ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction*, Context*, Context*, Context*, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#4 0x00000000006a4f42 in OSD::dispatch_context(PG::RecoveryCtx&, PG*, std::tr1::shared_ptr<OSDMap const>, ThreadPool::TPHandle*) ()
#5 0x00000000006c4833 in OSD::handle_pg_peering_evt(spg_t, pg_info_t const&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, unsigned int, pg_shard_t, bool, std::tr1::shared_ptr<PG::CephPeeringEvt>) ()
#6 0x00000000006c67b3 in OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>) ()
#7 0x00000000006c8cc0 in OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>) ()
#8 0x00000000006c9ede in OSD::_dispatch(Message*) ()
#9 0x00000000006ca5c7 in OSD::ms_dispatch(Message*) ()
#10 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#11 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#12 0x0000000000d109a5 in EventCenter::process_events(int) ()
#13 0x0000000000cece68 in Worker::entry() ()
#14 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fa739c411ad in clone () from /lib64/libc.so.6
Thread 45 (Thread 0x7fa735da1700 (LWP 17843)):
#0 0x00007fa73b88af7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fa73b886d32 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fa73b886c38 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000bfdbe8 in Mutex::Lock(bool) ()
#4 0x00000000006ca3a5 in OSD::ms_dispatch(Message*) ()
#5 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#6 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#7 0x0000000000d109a5 in EventCenter::process_events(int) ()
#8 0x0000000000cece68 in Worker::entry() ()
#9 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa739c411ad in clone () from /lib64/libc.so.6

osd_lock is main problem, in heavy io pressure. It's easy to block messenger thread dispatch message for seconds. If we create a big pool under heavy io pressure, it's easy to let osd down/in because of heartbeat message blocking. It will be more easy to show up in async messenger than simple messenger.


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #16377: jewel: msgr/async: Messenger thread long time lock hold riskResolvedLoïc DacharyActions
Actions #1

Updated by Haomai Wang almost 8 years ago

The lock time depends on the ObjectStore process time from 1s-5s even longer. It's easy to cause IO shake when exists non-fast dispatch message in.

Actions #2

Updated by Haomai Wang almost 8 years ago

It looks like async messenger need a extra DispatchQueue to handle nonfast message which may block thread totally.

Actions #3

Updated by Haomai Wang almost 8 years ago

  • Subject changed from Messenger thread long time lock hold risk to msgr/async: Messenger thread long time lock hold risk
  • Category set to msgr
Actions #4

Updated by Greg Farnum almost 8 years ago

Is this something that newly blocks for a long time?

Or is the problem that AsyncMessenger doesn't have per-connection threads and so all the other message processing gets blocked up? (And so this has been a problem for a long time, but SimpleMessenger masked it.)

Actions #5

Updated by Haomai Wang almost 8 years ago

yes, async messenger always easy to expose problem than simple. actually this pr(https://github.com/ceph/ceph/pull/8808) help a lot. If FileStore is busy in sync, a nonfast message like pg_log(from create pool action) will hold osd_lock and stuck into filestore condition wait. After we have OSDPing message fast dispatch, it could avoid osd down because of heartbeat timeout while ping message in the queue.

Actions #6

Updated by Sage Weil almost 8 years ago

  • Status changed from New to Pending Backport
  • Backport set to jewel

let's take our time backporting this... it should bake in master for a while first!

Actions #7

Updated by Nathan Cutler almost 8 years ago

  • Copied to Backport #16377: jewel: msgr/async: Messenger thread long time lock hold risk added
Actions #8

Updated by Loïc Dachary over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions #9

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions

Also available in: Atom PDF