Bug #3858
closedosd_client: ceph_osdc_wait_request() seems wrong
0%
Description
The only error wait_for_completion_interruptible() will
return is ERESTARTSYS. So if that gets returned inside
ceph_osdc_wait_request() it indicates something got
interrupted (not "canceled/timed out").
If such an error occurs, the code cancels, unregisters,
and completes the request. I don't think that makes
sense. The caller, which has set up the request, should
be responsible for cleaning up that stuff if the wait
returns an error.
As it stands, in the rbd code the caller puts the request
reference it has (from having allocated it), and I suspect
this is an extra reference drop.
It was too much to just fix on the fly, so I'm documenting
it here.
Updated by Ian Colle about 11 years ago
- Project changed from Linux kernel client to rbd
Updated by Alex Elder almost 11 years ago
I honestly don't know when it happened, but I now find
that rbd is not susceptible to this problem. All rbd
requests now supply a callback function, and if a
request is to be synchronous, code within rbd initiates
the completion to signal that event at the appropriate
time.
That being said I'm still looking at the other users of
this interface to make sure they're all good.
(As an aside, I used to get a crash when I'd try to
interrupt blocked rbd tests; I haven't seen them for
some time and I'm fairly sure that's because this
bug is no longer an issue for rbd.)
Updated by Alex Elder almost 11 years ago
- Status changed from New to Resolved
tl;dr: This is no longer a bug, so marking it resolved.
First of all, based on how it's used, if an
error occurs within ceph_osdc_wait_request(),
it should clean up whatever state changes were
made as a result of the preceding call to
ceph_osdc_start_request().
The state changes in function are due to calls
to _register_request() and __send_queued().
Reversing what's done by __register_request()
is handled by __unregister_request(). And
_cancel_request() takes care of revoking the
request message from the messenger.
So if wait_for_completion_interruptible() fails
the block of code executed seems to be doing the
right thing.
There are three callers of ceph_osd_wait_request():
ceph_sync_write()
ceph_osdc_readpages()
ceph_osdc_writepages()
For all three, the calling function is the creator of
the osd request, and as such each has its own reference
to it. So the only thing that needs to be done after
return from ceph_osdc_wait_request() is to drop that
reference to the osd request.
All three of these callers do that, so it looks to
me like rbd was the only one with the problem.