Project

General

Profile

Actions

Bug #42780

closed

recursive lock of OpTracker::lock (70)

Added by Jeff Layton over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I was testing 2 cephfs clients vs. a vstart cluster and the osd crashed.

ceph version 15.0.0-7148-g54b2868fc7d5 (54b2868fc7d5de5b7f554a7e57af478ec6bde93b) octopus (dev)
 1: (()+0x2da4e9e) [0x5628243c8e9e]
 2: (()+0x14b20) [0x7fa464ccdb20]
 3: (gsignal()+0x145) [0x7fa464769625]
 4: (abort()+0x12b) [0x7fa4647528d9]
 5: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x37a) [0x562824481e62]
 6: (lockdep_will_lock(char const*, int, bool, bool)+0x423) [0x56282452747f]
 7: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x4a) [0x562824531f0c]
 8: (ceph::shared_mutex_debug::lock_shared()+0x30) [0x5628245e0eb6]
 9: (std::shared_lock<ceph::shared_mutex_debug>::shared_lock(ceph::shared_mutex_debug&)+0x37) [0x562823a91d61]
 10: (OpTracker::record_history_op(boost::intrusive_ptr<TrackedOp>&&)+0x3a) [0x562823f861d8]
 11: (TrackedOp::put()+0x21c) [0x562823a5594a]
 12: (intrusive_ptr_release(TrackedOp*)+0x18) [0x562823a55d96]
 13: (boost::intrusive_ptr<TrackedOp>::~intrusive_ptr()+0x27) [0x562823a7841b]
 14: (boost::intrusive_ptr<TrackedOp>::operator=(TrackedOp*)+0x56) [0x562823a93832]
 15: (()+0x23fcaa1) [0x562823a20aa1]
 16: (()+0x2422368) [0x562823a46368]
 17: (std::function<bool (TrackedOp&)>::operator()(TrackedOp&) const+0x49) [0x562823f89a2d]
 18: (OpTracker::visit_ops_in_flight(utime_t*, std::function<bool (TrackedOp&)>&&)+0x43d) [0x562823f866e3]
 19: (OSD::get_health_metrics()+0x135) [0x562823a20c49]
 20: (OSD::tick_without_osd_lock()+0x7fe) [0x562823a0ceac]
 21: (OSD::C_Tick_WithoutOSDLock::finish(int)+0x1f) [0x562823a714ed]
 22: (Context::complete(int)+0x27) [0x562823a4f977]
 23: (SafeTimer::timer_thread()+0x38c) [0x562824465faa]
 24: (SafeTimerThread::entry()+0x1c) [0x562824467594]
 25: (Thread::entry_wrapper()+0x78) [0x56282445a158]
 26: (Thread::_entry_func(void*)+0x18) [0x56282445a0d6]
 27: (()+0x94e2) [0x7fa464cc24e2]
 28: (clone()+0x43) [0x7fa46482e643]

I'll attach the log. This was a custom build based on 72c63fe228b5b, with some MDS bits on top (nothing that should affect the OSD).


Files

osd.0.log.xz (100 KB) osd.0.log.xz Jeff Layton, 11/12/2019 08:22 PM

Related issues 1 (0 open1 closed)

Copied to RADOS - Backport #43473: nautilus: recursive lock of OpTracker::lock (70)ResolvedNathan CutlerActions
Actions #1

Updated by Neha Ojha over 4 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Neha Ojha over 4 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Sage Weil over 4 years ago

  • Status changed from New to 12

THe problem comes from OSD::get_health_metrics(), where the visitor lambda is holding the lock and also drops a reference:

    -2> 2019-11-12T15:08:02.385-0500 7fa45e619700  0 
recursive lock of OpTracker::lock (70)
 ceph version 15.0.0-7148-g54b2868fc7d5 (54b2868fc7d5de5b7f554a7e57af478ec6bde93b) octopus (dev)
 1: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x4a) [0x562824531f0c]
 2: (ceph::shared_mutex_debug::lock_shared()+0x30) [0x5628245e0eb6]
 3: (std::shared_lock<ceph::shared_mutex_debug>::shared_lock(ceph::shared_mutex_debug&)+0x37) [0x562823a91d61]
 4: (OpTracker::record_history_op(boost::intrusive_ptr<TrackedOp>&&)+0x3a) [0x562823f861d8]
 5: (TrackedOp::put()+0x21c) [0x562823a5594a]
 6: (intrusive_ptr_release(TrackedOp*)+0x18) [0x562823a55d96]
 7: (boost::intrusive_ptr<TrackedOp>::~intrusive_ptr()+0x27) [0x562823a7841b]
 8: (boost::intrusive_ptr<TrackedOp>::operator=(TrackedOp*)+0x56) [0x562823a93832]
 9: (()+0x23fcaa1) [0x562823a20aa1]
 10: (()+0x2422368) [0x562823a46368]
 11: (std::function<bool (TrackedOp&)>::operator()(TrackedOp&) const+0x49) [0x562823f89a2d]
 12: (OpTracker::visit_ops_in_flight(utime_t*, std::function<bool (TrackedOp&)>&&)+0x43d) [0x562823f866e3]
 13: (OSD::get_health_metrics()+0x135) [0x562823a20c49]
 14: (OSD::tick_without_osd_lock()+0x7fe) [0x562823a0ceac]
 15: (OSD::C_Tick_WithoutOSDLock::finish(int)+0x1f) [0x562823a714ed]
 16: (Context::complete(int)+0x27) [0x562823a4f977]
 17: (SafeTimer::timer_thread()+0x38c) [0x562824465faa]
 18: (SafeTimerThread::entry()+0x1c) [0x562824467594]
 19: (Thread::entry_wrapper()+0x78) [0x56282445a158]
 20: (Thread::_entry_func(void*)+0x18) [0x56282445a0d6]
 21: (()+0x94e2) [0x7fa464cc24e2]
 22: (clone()+0x43) [0x7fa46482e643]

(The reg gets one op assigned to it, and then later operator= drops the ref to assign a new value.)

I'm not sure what the right fix is here. We could have some kludge where the visitor sets a value saying this particular thread is visiting and exempt it from the ref tracking... but that feels very fragile. This whole subsystem feels very fragile.

Actions #4

Updated by Radoslaw Zarzynski over 4 years ago

  • Assignee set to Radoslaw Zarzynski
Actions #5

Updated by Radoslaw Zarzynski over 4 years ago

I will be working on this bug after returning from PTO (ETA: 16 Dec 2019).

Actions #6

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #7

Updated by Radoslaw Zarzynski over 4 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Radoslaw Zarzynski over 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to nautilus
Actions #9

Updated by Kefu Chai over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #43473: nautilus: recursive lock of OpTracker::lock (70) added
Actions #11

Updated by Nathan Cutler about 4 years ago

  • Pull request ID set to 32364
Actions #12

Updated by Nathan Cutler about 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF