Project

General

Profile

Actions

Bug #111

closed

handle EAGAIN from osd

Added by Sage Weil almost 14 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

currently we just return this to the caller, when we should retry.

not that the osd returns this very often (ever?)

Actions #1

Updated by Sage Weil almost 14 years ago

  • Target version changed from v2.6.35 to v2.6.36
Actions #2

Updated by Yehuda Sadeh almost 14 years ago

We should make the client handle it, but we should also try to make sure that the osd doesn't ever return it (at least shouldn't propagate EAGAIN returned from a syscall).

Actions #3

Updated by Sage Weil almost 14 years ago

Yehuda Sadeh wrote:

We should make the client handle it, but we should also try to make sure that the osd doesn't ever return it (at least shouldn't propagate EAGAIN returned from a syscall).

Currently it looks like the only place that happens is

ReplicatedPG.cc:2406:    return -EAGAIN;

which could be changed to put the request on the waiting_for_object queue instead.

I wonder, though, if that general approach will be problematic in the long run when we try to fix the unbounded memory usage. It might be better to queue op a "reply with -EAGAIN when we're ready to handle this" type of thing to avoid keeping lots of pending writes sitting around. (Of course, if the client has the osd timeout enabled, that will just mean a period reconnect and resend, too.)

Actions #4

Updated by Yehuda Sadeh almost 14 years ago

I agree. Though we should differentiate between two cases. One is that we initiate the EAGAIN (e.g., when reached a limit), the other is that we inherited it from a syscall we made -- although I'm not sure it's actually possible, as we don't really do async i/o at that context. In the former case we should just make sure that the EAGAIN worth the RTT to the client. We need to avoid the latter.

Actions #5

Updated by Greg Farnum almost 14 years ago

  • Status changed from New to Resolved

Looks to me like this can't actually happen. The function ReplicatedPG::find_object_context can return EAGAIN, and any find_object_context return codes are immediately sent to the client in ReplicatedPG::do_op, but as best I can see all the paths that lead to do_op go through OSD::handle_op, which makes sure the needed objects exist (for find_object_context to not return EAGAIN).
I threw an assert(r != -EAGAIN) in there to catch it if it does happen, though, in which case we can revisit this if needed.

Actions

Also available in: Atom PDF