Bug #42780
closed
recursive lock of OpTracker::lock (70)
Added by Jeff Layton over 4 years ago.
Updated about 4 years ago.
Description
I was testing 2 cephfs clients vs. a vstart cluster and the osd crashed.
ceph version 15.0.0-7148-g54b2868fc7d5 (54b2868fc7d5de5b7f554a7e57af478ec6bde93b) octopus (dev)
1: (()+0x2da4e9e) [0x5628243c8e9e]
2: (()+0x14b20) [0x7fa464ccdb20]
3: (gsignal()+0x145) [0x7fa464769625]
4: (abort()+0x12b) [0x7fa4647528d9]
5: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x37a) [0x562824481e62]
6: (lockdep_will_lock(char const*, int, bool, bool)+0x423) [0x56282452747f]
7: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x4a) [0x562824531f0c]
8: (ceph::shared_mutex_debug::lock_shared()+0x30) [0x5628245e0eb6]
9: (std::shared_lock<ceph::shared_mutex_debug>::shared_lock(ceph::shared_mutex_debug&)+0x37) [0x562823a91d61]
10: (OpTracker::record_history_op(boost::intrusive_ptr<TrackedOp>&&)+0x3a) [0x562823f861d8]
11: (TrackedOp::put()+0x21c) [0x562823a5594a]
12: (intrusive_ptr_release(TrackedOp*)+0x18) [0x562823a55d96]
13: (boost::intrusive_ptr<TrackedOp>::~intrusive_ptr()+0x27) [0x562823a7841b]
14: (boost::intrusive_ptr<TrackedOp>::operator=(TrackedOp*)+0x56) [0x562823a93832]
15: (()+0x23fcaa1) [0x562823a20aa1]
16: (()+0x2422368) [0x562823a46368]
17: (std::function<bool (TrackedOp&)>::operator()(TrackedOp&) const+0x49) [0x562823f89a2d]
18: (OpTracker::visit_ops_in_flight(utime_t*, std::function<bool (TrackedOp&)>&&)+0x43d) [0x562823f866e3]
19: (OSD::get_health_metrics()+0x135) [0x562823a20c49]
20: (OSD::tick_without_osd_lock()+0x7fe) [0x562823a0ceac]
21: (OSD::C_Tick_WithoutOSDLock::finish(int)+0x1f) [0x562823a714ed]
22: (Context::complete(int)+0x27) [0x562823a4f977]
23: (SafeTimer::timer_thread()+0x38c) [0x562824465faa]
24: (SafeTimerThread::entry()+0x1c) [0x562824467594]
25: (Thread::entry_wrapper()+0x78) [0x56282445a158]
26: (Thread::_entry_func(void*)+0x18) [0x56282445a0d6]
27: (()+0x94e2) [0x7fa464cc24e2]
28: (clone()+0x43) [0x7fa46482e643]
I'll attach the log. This was a custom build based on 72c63fe228b5b, with some MDS bits on top (nothing that should affect the OSD).
Files
- Project changed from Ceph to RADOS
- Priority changed from Normal to High
- Status changed from New to 12
THe problem comes from OSD::get_health_metrics(), where the visitor lambda is holding the lock and also drops a reference:
-2> 2019-11-12T15:08:02.385-0500 7fa45e619700 0
recursive lock of OpTracker::lock (70)
ceph version 15.0.0-7148-g54b2868fc7d5 (54b2868fc7d5de5b7f554a7e57af478ec6bde93b) octopus (dev)
1: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x4a) [0x562824531f0c]
2: (ceph::shared_mutex_debug::lock_shared()+0x30) [0x5628245e0eb6]
3: (std::shared_lock<ceph::shared_mutex_debug>::shared_lock(ceph::shared_mutex_debug&)+0x37) [0x562823a91d61]
4: (OpTracker::record_history_op(boost::intrusive_ptr<TrackedOp>&&)+0x3a) [0x562823f861d8]
5: (TrackedOp::put()+0x21c) [0x562823a5594a]
6: (intrusive_ptr_release(TrackedOp*)+0x18) [0x562823a55d96]
7: (boost::intrusive_ptr<TrackedOp>::~intrusive_ptr()+0x27) [0x562823a7841b]
8: (boost::intrusive_ptr<TrackedOp>::operator=(TrackedOp*)+0x56) [0x562823a93832]
9: (()+0x23fcaa1) [0x562823a20aa1]
10: (()+0x2422368) [0x562823a46368]
11: (std::function<bool (TrackedOp&)>::operator()(TrackedOp&) const+0x49) [0x562823f89a2d]
12: (OpTracker::visit_ops_in_flight(utime_t*, std::function<bool (TrackedOp&)>&&)+0x43d) [0x562823f866e3]
13: (OSD::get_health_metrics()+0x135) [0x562823a20c49]
14: (OSD::tick_without_osd_lock()+0x7fe) [0x562823a0ceac]
15: (OSD::C_Tick_WithoutOSDLock::finish(int)+0x1f) [0x562823a714ed]
16: (Context::complete(int)+0x27) [0x562823a4f977]
17: (SafeTimer::timer_thread()+0x38c) [0x562824465faa]
18: (SafeTimerThread::entry()+0x1c) [0x562824467594]
19: (Thread::entry_wrapper()+0x78) [0x56282445a158]
20: (Thread::_entry_func(void*)+0x18) [0x56282445a0d6]
21: (()+0x94e2) [0x7fa464cc24e2]
22: (clone()+0x43) [0x7fa46482e643]
(The reg gets one op assigned to it, and then later operator= drops the ref to assign a new value.)
I'm not sure what the right fix is here. We could have some kludge where the visitor sets a value saying this particular thread is visiting and exempt it from the ref tracking... but that feels very fragile. This whole subsystem feels very fragile.
- Assignee set to Radoslaw Zarzynski
I will be working on this bug after returning from PTO (ETA: 16 Dec 2019).
- Status changed from 12 to New
- Status changed from New to In Progress
- Status changed from In Progress to Fix Under Review
- Backport set to nautilus
- Status changed from Fix Under Review to Pending Backport
- Copied to Backport #43473: nautilus: recursive lock of OpTracker::lock (70) added
- Pull request ID set to 32364
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Also available in: Atom
PDF