Bug #5146

krbd: wait for safe callback for writes

Added by Alex Elder about 7 years ago. Updated about 7 years ago.

Target version:
% Done:


3 - minor
Affected Versions:
Pull request ID:
Crash signature:


Right now rbd only waits for the acknowledgement callback
for all osd requests. This means that an rbd client may
have assumed written data is durable when it is not (in
the event the acknowledgement got sent just before an
osd disappeared).

Change rbd to wait for the safe callback for all write
requests. For convenience, change the osd client to call
the safe callback for read requests as well (that is now
only done for write requests).


#1 Updated by Alex Elder about 7 years ago

I have this implemented and will post a patch for
review after I've tested. It was easier than

Note, despite what I originally suggested I did not
have the osd client call the safe callback for reads.

#2 Updated by Alex Elder about 7 years ago

  • Status changed from In Progress to Fix Under Review

The following patch has been posted for review. It's one of three
new patches available in the "review/wip-rbd" branch of the
ceph-client git respository.

[PATCH] rbd: wait for safe callback for write requests

#3 Updated by Alex Elder about 7 years ago

Josh has reviewed this patch and the two others I posted
with it. I was testing the three of them together yesterday
and hit a strange error, so I've held off committing any of
these patches. Here are the two messages (I saw

libceph: get_reply front 130 > preallocated 117
libceph: read_partial_message skipping long message (36864 > 0)

I got that pair of messages, and then the connection to osd4
got closed 900 seconds later, and again 900 seconds after that,
and so on. At some point (over 6 hours later) the "skipping
long message" message appeared again, alone. That sort of
pattern (socket closing after a delay, repeatedly followed
by "skipping long message") repeated a few more times, with
the time between the "skipping" varying a lot.

The size of the front portion of the incoming message suggests
the message was the reply from an osd request to a format 1
image with a single op:

 4    object name length (= 31)
31    object name (something like "rb.0.1063.63028a35.000000000000" 
17    pgid (version[1], pool[8], seed[4], preferred[4])
 8    flags
 4    result
 4    reassert epoch
 8    reassert version
 4    osdmap epoch 
 4    num_ops (= 1)
38    1 osd op
 4    retry attempt 
 4    1 op result
130 bytes

The indicated "preallocated" size is what was set aside for
the front portion of the reply message when the message
got created. 117 makes no sense, because all osd reply
messages are allocated with size 512 bytes.

So something seems to have gotten corrupted along the way
in the osd request's reply message.

Furthermore, the long message indicates that the reply
contained 36864 (= 36 * 1024) bytes of data, but the
reply message had no room (0 bytes) available to receive

So I conclude from this that the response message set aside
for a request to osd 4 got corrupted (I suppose reused)
while it was in flight.

And now that I've analyzed this, my suspicion is it has to
do with the patch for,
"libceph: add lingering request reference when registered".

I think perhaps there was another spot where something
like kick_requests() needs to take a reference to avoid
a data structure going away while we unregister it and
re-register it.

#4 Updated by Alex Elder about 7 years ago

  • Status changed from Fix Under Review to Resolved

The following has been committed to the ceph-client
"testing" branch:

70c725f rbd: wait for safe callback for write requests

Also available in: Atom PDF