Project

General

Profile

Actions

Bug #3937

closed

krbd: crash in rbd_assert(osd_req == obj_request->osd_req)

Added by Alex Elder over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looking at a crash this morning in the new request code due
to this failed assertion in rbd_osd_req_callback():
rbd_assert(osd_req == obj_request->osd_req);

When an osd request is submitted we stash a pointer to the
object request structure in its r_priv field, and assign
rbd_osd_req_callback() as its r_callback function.

This assertion is saying that an osd request completed
and apparently had a valid r_priv pointer, but it did
not properly point back at the osd request.

This suggests either the osd request or the object
request was used after being freed, or something more
insidious is going on.

Actions #1

Updated by Alex Elder over 11 years ago

Adding two things:
- this occurred during test 190 of the third consecutive pass
of xfstests with this in the teuthology yaml definition:

- workunit:
    clients:
      all:
        - rbd/map-unmap.sh
        - rbd/kernel.sh
        - misc/trivial_sync.sh
        - suites/blogbench.sh
        - suites/dbench.sh
        - suites/tiobench.sh
        - suites/fsstress.sh
        - kernel_untar_build.sh
- rbd.xfstests:
    client.0:
        count: 3

- At some point--possibly at the point this failure
occurred--a VPN server was restarted. I don't have
evidence that they're related but the failure and
the VPN restart happened within the same one hour
time window, so they could be.

Actions #2

Updated by Alex Elder over 11 years ago

I've decoded the osd request that's been provided to
rbd_osd_req_callback(). Its contents look completely
legitimate. Its r_priv pointer, which should be a
(struct obj_request *) does not contain an object
request, however. This means the object request has
probably been freed.

This osd request is marked for linger. That means the
osd client will hang onto a copy of it until rbd
unregisters it.

So I think that may point out the problem, and it could
well have arisen because of the network blip.

(I'm not completely sure about this, but here's the
theory...)

I think that the rbd device needs to take an extra
reference to this osd linger request to make sure it
hangs around until it gets unregistered. Otherwise
as soon as the request marked for linger completes,
it will be destroyed, and if the osd client indicates
it completes again the r_priv pointer will be invalid.

The thing I'm not sure about is whether or when or why
the osd client would complete the same osd request
more than once.

I'm going to look at that a bit now.

Actions #3

Updated by Alex Elder over 11 years ago

I have confirmed that every time a request registered to linger
is re-submitted the osd client will call the callback function
for that request. This was not an issue with the previous
request code, because watch requests were synchronous (with
respect to the osd client) and thus had no osd client callback
function.

The new code always has a callback, so we need to handle it.

The fix is to add a new reference to the object request every
time a CEPH_OSD_OP_WATCH request completes, except if the
request was issued to cancel the watch request. That way
the object request will stay around, but when the watch
request gets unregistered it will go away.

I have implemented this (complete with a big comment that
explains the situation) and now need to test it.

Actions #4

Updated by Alex Elder over 11 years ago

OK, with Josh's help I finally managed to reproduce the
problem intentionally to check my fix.

I'm building it now (along with a few other patches to
make the change easier to follow).

I guess I'll post it for review with a caveat that it
follows on the new request code.

Actions #5

Updated by Alex Elder over 11 years ago

  • Status changed from New to Fix Under Review

A patch resolving this has been posted for review.

[PATCH 4/4] rbd: don't drop watch requests on completion

Actions #6

Updated by Alex Elder about 11 years ago

I've opened a new issue that has symptoms similar to this
but not identical:
http://tracker.ceph.com/issues/3950

In the current case, the osd request seems fine, but its
osd_req->priv pointer does not seem to point to a valid
object request pointer. That suggests the object request
has been freed.

In that (3950) case, the object request pointer seems to
be valid, but its osd_req pointer does not refer back to
the osd request.

This doesn't rule out that these are slightly different
symptoms of the same problem. (But I'm running the test
with what's supposed to be a fix in place for the present
problem...)

Actions #7

Updated by Alex Elder about 11 years ago

  • Status changed from Fix Under Review to 7

The patch is reviewed and ready to push to the testing
branch, and I will do that in a day or so.

I'm going to leave http://tracker.ceph.com/issues/3950
open for the time being, on the assumption it was due
to a different problem

Actions #8

Updated by Alex Elder about 11 years ago

  • Status changed from 7 to Resolved

commit 8d93192992301f8c3a288c8cf4dc8598ac4b8427
Author: Alex Elder <>
Date: Fri Jan 25 17:08:55 2013 -0600

rbd: don't drop watch requests on completion
Actions

Also available in: Atom PDF