Project

General

Profile

Actions

Bug #15758

closed

msgr/async: Messenger thread long time lock hold risk

Added by Haomai Wang almost 8 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Thread 46 (Thread 0x7fa7365a2700 (LWP 17842)):
#0 0x00007fa73b888705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000009e135d in FileStore::op_queue_reserve_throttle(FileStore::Op*, ThreadPool::TPHandle*) ()
#2 0x00000000009f16e5 in FileStore::queue_transactions(ObjectStore::Sequencer*, std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#3 0x00000000006e9b19 in ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction*, Context*, Context*, Context*, std::tr1::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#4 0x00000000006a4f42 in OSD::dispatch_context(PG::RecoveryCtx&, PG*, std::tr1::shared_ptr<OSDMap const>, ThreadPool::TPHandle*) ()
#5 0x00000000006c4833 in OSD::handle_pg_peering_evt(spg_t, pg_info_t const&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, unsigned int, pg_shard_t, bool, std::tr1::shared_ptr<PG::CephPeeringEvt>) ()
#6 0x00000000006c67b3 in OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>) ()
#7 0x00000000006c8cc0 in OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>) ()
#8 0x00000000006c9ede in OSD::_dispatch(Message*) ()
#9 0x00000000006ca5c7 in OSD::ms_dispatch(Message*) ()
#10 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#11 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#12 0x0000000000d109a5 in EventCenter::process_events(int) ()
#13 0x0000000000cece68 in Worker::entry() ()
#14 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fa739c411ad in clone () from /lib64/libc.so.6
Thread 45 (Thread 0x7fa735da1700 (LWP 17843)):
#0 0x00007fa73b88af7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fa73b886d32 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fa73b886c38 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000bfdbe8 in Mutex::Lock(bool) ()
#4 0x00000000006ca3a5 in OSD::ms_dispatch(Message*) ()
#5 0x0000000000d5aa97 in Messenger::ms_deliver_dispatch(Message*) ()
#6 0x0000000000d5b111 in C_handle_dispatch::do_request(int) ()
#7 0x0000000000d109a5 in EventCenter::process_events(int) ()
#8 0x0000000000cece68 in Worker::entry() ()
#9 0x00007fa73b884df5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa739c411ad in clone () from /lib64/libc.so.6

osd_lock is main problem, in heavy io pressure. It's easy to block messenger thread dispatch message for seconds. If we create a big pool under heavy io pressure, it's easy to let osd down/in because of heartbeat message blocking. It will be more easy to show up in async messenger than simple messenger.


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #16377: jewel: msgr/async: Messenger thread long time lock hold riskResolvedLoïc DacharyActions
Actions

Also available in: Atom PDF