Josh has reviewed this patch and the two others I posted
with it. I was testing the three of them together yesterday
and hit a strange error, so I've held off committing any of
these patches. Here are the two messages (I saw
libceph: get_reply front 130 > preallocated 117
libceph: read_partial_message skipping long message (36864 > 0)
I got that pair of messages, and then the connection to osd4
got closed 900 seconds later, and again 900 seconds after that,
and so on. At some point (over 6 hours later) the "skipping
long message" message appeared again, alone. That sort of
pattern (socket closing after a delay, repeatedly followed
by "skipping long message") repeated a few more times, with
the time between the "skipping" varying a lot.
The size of the front portion of the incoming message suggests
the message was the reply from an osd request to a format 1
image with a single op:
4 object name length (= 31)
31 object name (something like "rb.0.1063.63028a35.000000000000"
17 pgid (version[1], pool[8], seed[4], preferred[4])
8 flags
4 result
4 reassert epoch
8 reassert version
4 osdmap epoch
4 num_ops (= 1)
38 1 osd op
4 retry attempt
4 1 op result
---
130 bytes
The indicated "preallocated" size is what was set aside for
the front portion of the reply message when the message
got created. 117 makes no sense, because all osd reply
messages are allocated with size 512 bytes.
So something seems to have gotten corrupted along the way
in the osd request's reply message.
Furthermore, the long message indicates that the reply
contained 36864 (= 36 * 1024) bytes of data, but the
reply message had no room (0 bytes) available to receive
it.
So I conclude from this that the response message set aside
for a request to osd 4 got corrupted (I suppose reused)
while it was in flight.
And now that I've analyzed this, my suspicion is it has to
do with the patch for http://tracker.ceph.com/issues/3859,
"libceph: add lingering request reference when registered".
I think perhaps there was another spot where something
like kick_requests() needs to take a reference to avoid
a data structure going away while we unregister it and
re-register it.