Project

General

Profile

Actions

Bug #3858

closed

osd_client: ceph_osdc_wait_request() seems wrong

Added by Alex Elder over 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The only error wait_for_completion_interruptible() will
return is ERESTARTSYS. So if that gets returned inside
ceph_osdc_wait_request() it indicates something got
interrupted (not "canceled/timed out").

If such an error occurs, the code cancels, unregisters,
and completes the request. I don't think that makes
sense. The caller, which has set up the request, should
be responsible for cleaning up that stuff if the wait
returns an error.

As it stands, in the rbd code the caller puts the request
reference it has (from having allocated it), and I suspect
this is an extra reference drop.

It was too much to just fix on the fly, so I'm documenting
it here.

Actions #1

Updated by Ian Colle about 11 years ago

  • Project changed from Linux kernel client to rbd
Actions #2

Updated by Alex Elder almost 11 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Alex Elder almost 11 years ago

  • Assignee set to Alex Elder
Actions #4

Updated by Alex Elder almost 11 years ago

I honestly don't know when it happened, but I now find
that rbd is not susceptible to this problem. All rbd
requests now supply a callback function, and if a
request is to be synchronous, code within rbd initiates
the completion to signal that event at the appropriate
time.

That being said I'm still looking at the other users of
this interface to make sure they're all good.

(As an aside, I used to get a crash when I'd try to
interrupt blocked rbd tests; I haven't seen them for
some time and I'm fairly sure that's because this
bug is no longer an issue for rbd.)

Actions #5

Updated by Alex Elder almost 11 years ago

  • Status changed from New to Resolved

tl;dr: This is no longer a bug, so marking it resolved.

First of all, based on how it's used, if an
error occurs within ceph_osdc_wait_request(),
it should clean up whatever state changes were
made as a result of the preceding call to
ceph_osdc_start_request().

The state changes in function are due to calls
to _register_request() and __send_queued().
Reversing what's done by __register_request()
is handled by __unregister_request(). And
_cancel_request() takes care of revoking the
request message from the messenger.

So if wait_for_completion_interruptible() fails
the block of code executed seems to be doing the
right thing.

There are three callers of ceph_osd_wait_request():

ceph_sync_write()
ceph_osdc_readpages()
ceph_osdc_writepages()

For all three, the calling function is the creator of
the osd request, and as such each has its own reference
to it. So the only thing that needs to be done after
return from ceph_osdc_wait_request() is to drop that
reference to the osd request.

All three of these callers do that, so it looks to
me like rbd was the only one with the problem.

Actions

Also available in: Atom PDF