krbd: wait for safe callback for writes
Right now rbd only waits for the acknowledgement callback
for all osd requests. This means that an rbd client may
have assumed written data is durable when it is not (in
the event the acknowledgement got sent just before an
Change rbd to wait for the safe callback for all write
requests. For convenience, change the osd client to call
the safe callback for read requests as well (that is now
only done for write requests).
#2 Updated by Alex Elder about 7 years ago
- Status changed from In Progress to Fix Under Review
The following patch has been posted for review. It's one of three
new patches available in the "review/wip-rbd" branch of the
ceph-client git respository.
[PATCH] rbd: wait for safe callback for write requests
#3 Updated by Alex Elder about 7 years ago
Josh has reviewed this patch and the two others I posted
with it. I was testing the three of them together yesterday
and hit a strange error, so I've held off committing any of
these patches. Here are the two messages (I saw
libceph: get_reply front 130 > preallocated 117 libceph: read_partial_message skipping long message (36864 > 0)
I got that pair of messages, and then the connection to osd4
got closed 900 seconds later, and again 900 seconds after that,
and so on. At some point (over 6 hours later) the "skipping
long message" message appeared again, alone. That sort of
pattern (socket closing after a delay, repeatedly followed
by "skipping long message") repeated a few more times, with
the time between the "skipping" varying a lot.
The size of the front portion of the incoming message suggests
the message was the reply from an osd request to a format 1
image with a single op:
4 object name length (= 31) 31 object name (something like "rb.0.1063.63028a35.000000000000" 17 pgid (version, pool, seed, preferred) 8 flags 4 result 4 reassert epoch 8 reassert version 4 osdmap epoch 4 num_ops (= 1) 38 1 osd op 4 retry attempt 4 1 op result --- 130 bytes
The indicated "preallocated" size is what was set aside for
the front portion of the reply message when the message
got created. 117 makes no sense, because all osd reply
messages are allocated with size 512 bytes.
So something seems to have gotten corrupted along the way
in the osd request's reply message.
Furthermore, the long message indicates that the reply
contained 36864 (= 36 * 1024) bytes of data, but the
reply message had no room (0 bytes) available to receive
So I conclude from this that the response message set aside
for a request to osd 4 got corrupted (I suppose reused)
while it was in flight.
And now that I've analyzed this, my suspicion is it has to
do with the patch for http://tracker.ceph.com/issues/3859,
"libceph: add lingering request reference when registered".
I think perhaps there was another spot where something
like kick_requests() needs to take a reference to avoid
a data structure going away while we unregister it and