Bug #42780
closedrecursive lock of OpTracker::lock (70)
0%
Description
I was testing 2 cephfs clients vs. a vstart cluster and the osd crashed.
ceph version 15.0.0-7148-g54b2868fc7d5 (54b2868fc7d5de5b7f554a7e57af478ec6bde93b) octopus (dev) 1: (()+0x2da4e9e) [0x5628243c8e9e] 2: (()+0x14b20) [0x7fa464ccdb20] 3: (gsignal()+0x145) [0x7fa464769625] 4: (abort()+0x12b) [0x7fa4647528d9] 5: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x37a) [0x562824481e62] 6: (lockdep_will_lock(char const*, int, bool, bool)+0x423) [0x56282452747f] 7: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x4a) [0x562824531f0c] 8: (ceph::shared_mutex_debug::lock_shared()+0x30) [0x5628245e0eb6] 9: (std::shared_lock<ceph::shared_mutex_debug>::shared_lock(ceph::shared_mutex_debug&)+0x37) [0x562823a91d61] 10: (OpTracker::record_history_op(boost::intrusive_ptr<TrackedOp>&&)+0x3a) [0x562823f861d8] 11: (TrackedOp::put()+0x21c) [0x562823a5594a] 12: (intrusive_ptr_release(TrackedOp*)+0x18) [0x562823a55d96] 13: (boost::intrusive_ptr<TrackedOp>::~intrusive_ptr()+0x27) [0x562823a7841b] 14: (boost::intrusive_ptr<TrackedOp>::operator=(TrackedOp*)+0x56) [0x562823a93832] 15: (()+0x23fcaa1) [0x562823a20aa1] 16: (()+0x2422368) [0x562823a46368] 17: (std::function<bool (TrackedOp&)>::operator()(TrackedOp&) const+0x49) [0x562823f89a2d] 18: (OpTracker::visit_ops_in_flight(utime_t*, std::function<bool (TrackedOp&)>&&)+0x43d) [0x562823f866e3] 19: (OSD::get_health_metrics()+0x135) [0x562823a20c49] 20: (OSD::tick_without_osd_lock()+0x7fe) [0x562823a0ceac] 21: (OSD::C_Tick_WithoutOSDLock::finish(int)+0x1f) [0x562823a714ed] 22: (Context::complete(int)+0x27) [0x562823a4f977] 23: (SafeTimer::timer_thread()+0x38c) [0x562824465faa] 24: (SafeTimerThread::entry()+0x1c) [0x562824467594] 25: (Thread::entry_wrapper()+0x78) [0x56282445a158] 26: (Thread::_entry_func(void*)+0x18) [0x56282445a0d6] 27: (()+0x94e2) [0x7fa464cc24e2] 28: (clone()+0x43) [0x7fa46482e643]
I'll attach the log. This was a custom build based on 72c63fe228b5b, with some MDS bits on top (nothing that should affect the OSD).
Files
Updated by Sage Weil over 4 years ago
- Status changed from New to 12
THe problem comes from OSD::get_health_metrics(), where the visitor lambda is holding the lock and also drops a reference:
-2> 2019-11-12T15:08:02.385-0500 7fa45e619700 0 recursive lock of OpTracker::lock (70) ceph version 15.0.0-7148-g54b2868fc7d5 (54b2868fc7d5de5b7f554a7e57af478ec6bde93b) octopus (dev) 1: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x4a) [0x562824531f0c] 2: (ceph::shared_mutex_debug::lock_shared()+0x30) [0x5628245e0eb6] 3: (std::shared_lock<ceph::shared_mutex_debug>::shared_lock(ceph::shared_mutex_debug&)+0x37) [0x562823a91d61] 4: (OpTracker::record_history_op(boost::intrusive_ptr<TrackedOp>&&)+0x3a) [0x562823f861d8] 5: (TrackedOp::put()+0x21c) [0x562823a5594a] 6: (intrusive_ptr_release(TrackedOp*)+0x18) [0x562823a55d96] 7: (boost::intrusive_ptr<TrackedOp>::~intrusive_ptr()+0x27) [0x562823a7841b] 8: (boost::intrusive_ptr<TrackedOp>::operator=(TrackedOp*)+0x56) [0x562823a93832] 9: (()+0x23fcaa1) [0x562823a20aa1] 10: (()+0x2422368) [0x562823a46368] 11: (std::function<bool (TrackedOp&)>::operator()(TrackedOp&) const+0x49) [0x562823f89a2d] 12: (OpTracker::visit_ops_in_flight(utime_t*, std::function<bool (TrackedOp&)>&&)+0x43d) [0x562823f866e3] 13: (OSD::get_health_metrics()+0x135) [0x562823a20c49] 14: (OSD::tick_without_osd_lock()+0x7fe) [0x562823a0ceac] 15: (OSD::C_Tick_WithoutOSDLock::finish(int)+0x1f) [0x562823a714ed] 16: (Context::complete(int)+0x27) [0x562823a4f977] 17: (SafeTimer::timer_thread()+0x38c) [0x562824465faa] 18: (SafeTimerThread::entry()+0x1c) [0x562824467594] 19: (Thread::entry_wrapper()+0x78) [0x56282445a158] 20: (Thread::_entry_func(void*)+0x18) [0x56282445a0d6] 21: (()+0x94e2) [0x7fa464cc24e2] 22: (clone()+0x43) [0x7fa46482e643]
(The reg gets one op assigned to it, and then later operator= drops the ref to assign a new value.)
I'm not sure what the right fix is here. We could have some kludge where the visitor sets a value saying this particular thread is visiting and exempt it from the ref tracking... but that feels very fragile. This whole subsystem feels very fragile.
Updated by Radoslaw Zarzynski over 4 years ago
- Assignee set to Radoslaw Zarzynski
Updated by Radoslaw Zarzynski over 4 years ago
I will be working on this bug after returning from PTO (ETA: 16 Dec 2019).
Updated by Radoslaw Zarzynski over 4 years ago
- Status changed from New to In Progress
Updated by Radoslaw Zarzynski over 4 years ago
- Status changed from In Progress to Fix Under Review
- Backport set to nautilus
Updated by Kefu Chai over 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43473: nautilus: recursive lock of OpTracker::lock (70) added
Updated by Nathan Cutler about 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".