Bug #22882

Objecter deadlocked on op budget while holding rwlock in ms_handle_reset()

Added by Jason Dillaman almost 5 years ago. Updated over 4 years ago.

Target version:
% Done:


3 - minor
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Thread 50 (Thread 0x7fbd75ff3700 (LWP 26238)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fbdc5fe08ad in Cond::Wait (mutex=..., this=0x7fbd64041ea0) at /build/ceph-12.2.2-558-g5445703/src/common/Cond.h:48
#2  Throttle::_wait (this=this@entry=0x55683eaa6250, c=c@entry=1) at /build/ceph-12.2.2-558-g5445703/src/common/
#3  0x00007fbdc5fe16d1 in Throttle::get (this=this@entry=0x55683eaa6250, c=c@entry=1, m=m@entry=0) at /build/ceph-12.2.2-558-g5445703/src/common/
#4  0x00007fbdceb2f977 in Objecter::_throttle_op (this=this@entry=0x55683eaa5c60, op=op@entry=0x7fbd6413e090, sul=..., op_budget=op_budget@entry=0)
    at /build/ceph-12.2.2-558-g5445703/src/osdc/
---Type <return> to continue, or q <return> to quit---
#5  0x00007fbdceb4c7d8 in Objecter::_take_op_budget (sul=..., op=0x7fbd6413e090, this=0x55683eaa5c60) at /build/ceph-12.2.2-558-g5445703/src/osdc/Objecter.h:1980
#6  Objecter::_op_submit_with_budget (this=this@entry=0x55683eaa5c60, op=op@entry=0x7fbd6413e090, sul=..., ptid=ptid@entry=0x7fbd5c053dc0, ctx_budget=ctx_budget@entry=0x0)
    at /build/ceph-12.2.2-558-g5445703/src/osdc/
#7  0x00007fbdceb4f444 in Objecter::_send_linger (this=this@entry=0x55683eaa5c60, info=info@entry=0x7fbd5c0539d0, sul=...) at /build/ceph-12.2.2-558-g5445703/src/osdc/
#8  0x00007fbdceb4fdff in Objecter::_linger_ops_resend (this=this@entry=0x55683eaa5c60, lresend=std::map with 1 elements = {...}, ul=...)
    at /build/ceph-12.2.2-558-g5445703/src/osdc/
#9  0x00007fbdceb5047d in Objecter::ms_handle_reset (this=0x55683eaa5c60, con=<optimized out>) at /build/ceph-12.2.2-558-g5445703/src/osdc/
#10 0x00007fbdc6061c6d in Messenger::ms_deliver_handle_reset (con=0x7fbd6401d3f0, this=0x55683ea1a0f0) at /build/ceph-12.2.2-558-g5445703/src/msg/Messenger.h:741
#11 DispatchQueue::entry (this=0x55683ea1a270) at /build/ceph-12.2.2-558-g5445703/src/msg/
#12 0x00007fbdc615907d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-12.2.2-558-g5445703/src/msg/DispatchQueue.h:101
#13 0x00007fbdd93f16ba in start_thread (arg=0x7fbd75ff3700) at pthread_create.c:333
#14 0x00007fbdd91273dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:74
#15 0x0000000000000000 in ?? ()

The objecter in this case did not have 1024 in-flight ops, so it does appear that there is potentially a leak in tracking the op budget (potentially when socket failures are introduced).


#2 Updated by Greg Farnum almost 5 years ago

It's not quite that simple; ops on a failed OSD or closed session get moved into the homeless_session and at a quick skim that is at least doing work.

Do you have any other context about what's happened here, Jason? We do lots and lots of session killing and message pass pausing and things so I suspect there must be a confounding factor here.

#3 Updated by Jason Dillaman almost 5 years ago

When I saw the test running for 4 hours my first thought was that the cluster was unhealthy -- but all OSDs were up and "ceph health" reported OK. After installing debug packages in the environment, I attached and noticed that the dispatch thread was fully deadlocked and making no progress due to that call chain (probably shouldn't ever block the dispatch thread but instead handle re-registering linger ops on a separate thread).

#4 Updated by Greg Farnum almost 5 years ago

  • Assignee set to Greg Farnum

#5 Updated by Greg Farnum almost 5 years ago

  • Subject changed from Objecter deadlocked waiting for op budget after socket failure to Objecter deadlocked on op budget while holding rwlock in ms_handle_reset()

Okay, so presumably on resend you shouldn't need to grab op budget again, since it's already budgeted, right?

And indeed there is logic in _send_linger() that tried to avoid doing that, but it's checking whether the LingerOp::register_tid field is set, and if not it takes budget again.

Meanwhile, in _take_op_budget() it marks the Op::budgeted bool when budget is taken. But that's only read in Objecter::_finish_op() (not even in _take_op_budget!?!?)

Looking at how LingerOp::register_tid gets set, it's only used in _send_linger(), and that's when passing it by pointer to the op_submit functions (as their ptid), and nothing clears it out.

So setting register_tid is separated from taking the budget by "OSDSession::unique_lock sl(s->lock);", which I suppose blocks when the session is getting failed. But there's not a check once it is grabbed that the state is still valid, so it's going to go ahead and do the _session_op_assign() and _send_op() in that thread. Meanwhile the reset thread is invoking _cancel_linger_op (and thus _finish_op) via _kick_requests(). That...might not conflict? But it sure is ugly.

...oh, that also means that the dispatch loop thread definitely has zapped the LingerOp's budget, because it happened in _finish_op() before we got to actually resending the LingerOp.

So that whole sequence is deeply unpleasant, but doesn't appear to leak anything to me. Which means that if we're blocking, it's because something else is holding budget that it can't drop. This is probably because we're holding the global rwlock?

This is a long way of saying, yeah, we shouldn't do anything that can block in the dispatch thread. Especially since it can block while holding a global write lock! But I wish I could understand how we'd gotten that budget stolen away from us.

#6 Updated by Greg Farnum almost 5 years ago

Jason, how did you establish the number of in-flight ops? I wonder if maybe it did have them but they weren't able to register for output with the rwlock being held by the dispatch thread.

#7 Updated by Greg Farnum almost 5 years ago

Oh, and I had the LingerOp and Op conflated in my head a bit when looking at that before, but they are different.

Oh, and of course any linger ops we have registered to a session might not actually have budget yet when they got killed! So just trying to keep the budget across failed incarnations using the ctx_budget mechanism wouldn't do us any good. That's unpleasant; I'm not sure if we have a convenient thread available for these. Maybe we can change things so we grab budget before we associate a LingerOp with a Session so that it gets grabbed up in _kick_requests()?

But the locking here makes that really gross. I'll have to think about this and probably talk about it a bit.

#8 Updated by Greg Farnum almost 5 years ago

  • Status changed from New to In Progress

I finally realized that the op throttler does drop the global rwlock while waiting for throttle, so it at least doesn't immediately deadlock that way (which I thought it could). Which again narrows the window for how this problem happened.

Right now I'm pulling the op budget up to the top level for LingerOps using the "context budget" mechanism that got added for listing. That'll definitely help with things, at the cost of making every Watch take up a permanent op budget slot. That makes sense to me though since they're still doing stuff like sending out ping messages and such...

#9 Updated by Greg Farnum almost 5 years ago

  • Status changed from In Progress to Fix Under Review

#10 Updated by Greg Farnum over 4 years ago

  • Status changed from Fix Under Review to Resolved

Whoops, this merged way back then with a slightly different plan than discussed here (see PR discussion).

Also available in: Atom PDF