Project

General

Profile

Actions

Bug #4284

closed

rbd: bio getting set more than once in a message?

Added by Alex Elder about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have to reproduce this again because my target machine
that hit this got rebooted. But a recent proposed change
encapsulates setting the bio field (and a couple others)
in a message. It adds an assertion that verifies it never
gets done twice.

I got a crash that I believe hit one of those assertions.

It's not directly related to any checked-in code, but I'm
spending some time (hours) on it so I thought I should
give it its own bug.

Actions #1

Updated by Alex Elder about 11 years ago

I have reproduced the problem--just. I haven't had a chance
to look at it but I think I will soon be able to identify
exactly what's caused the crash, and can work back from there.

Actions #2

Updated by Alex Elder about 11 years ago

We are in the path where an osd response message has
been received, and the messenger has called its
alloc_message connection operation (method). There
is incoming data, and we have just set the pages
and are now setting the bio in the reply message
found in the request found via __lookup_request().

Actions #3

Updated by Alex Elder about 11 years ago

OK, back again after a meeting.

I was thinking about it, and I wonder whether this has to
do with a message that got re-sent (or re-sent by the
other end, therefore a duplicate receive or something).

Not sure at this point, I just thought I'd mention it.

Actions #4

Updated by Alex Elder about 11 years ago

I'm too tired to offer a guarantee, but I'm pretty sure
this is what's going on. A connection fault occurs while
receiving an incoming response to an in-flight osd request.
Once that response starts getting received, its bio pointer
will have been set in the response message.

The request gets re-sent, and when the response to that
re-sent request arrives, the same response message is used
to receive the the response and its data. The bio pointer
gets set there again, which is triggering the assertion
failure.

One way to fix this is to insert handling that clears
that bio pointer (and other data pointers) in the
response message whenever an in-flight request gets
scheduled to be resent. I would like to do this at
some point, but I think there's a quicker fix.

Namely, rather than assert the bio pointer (or, e.g.,
the pages pointer) is non-null when it gets assigned,
assert it is either non-null or it is equal to what
is getting assigned.

I don't like this because it's not nearly as strong
a check, but it should suffice and I think it can
allow me to proceed.

I'll reconsider this explanation in the morning and
will try out what I propose above.

Actions #5

Updated by Alex Elder about 11 years ago

Testing once more with key debug messages enabled that will I
think allow me to prove my explanation is correct.

Actions #6

Updated by Alex Elder about 11 years ago

After further consideration I don't think this is going to
be good enough. My goal here is to allow multiple blocks
of data of arbitrary type to be added to a message so it
can be used for sending (or receiving) from multiple ops.
So I plan to stay with that stronger assertion.

The problem is that the data fields are getting set when
the message is sent, which can happen more than once.
Instead, they should be set up by the builder of the message
once it hands it off to the osd client for processing.

Basically, for the case at hand bio pointer needs to be
set in the osd response message by the rbd code before
it calls ceph_osdc_start_request(). That way the message
is all ready for receiving the data when it arrives.

I don't expect this to be hard to change, but it'll probably
require a few patches to do cleanly.

Actions #7

Updated by Alex Elder about 11 years ago

My test to verify my explanation seems to have done so.

I had a message print when get_reply() was called. This
is called when the messenger has incoming data on an osd
connection to allocate a message into which to receive it.
get_reply() looks at the message header and looks up the
osd request corresponding to its transaction id, and
returns the reply message allocated for that message, and
just before doing so, it assigns the bio field.

I also had a message print when handle_reply() is called,
which is when the osd processes the reply and completes
the request.

The assertion failed after get_reply() was called--but
before handle_reply() had been called--for transaction
157261. Before the assertion hit, the osd connection
was reset and the request for transaction 157261 was
resent.

The assertion failed when the messenger was calling
get_reply() (again) for transaction 157261, and it
found the bio pointer had already been set.

Actions #8

Updated by Alex Elder about 11 years ago

I think I'm going to check in the functions that
isolate/abstract message data operations without
assertions enabled. When I do that, I'll close
this: http://tracker.ceph.com/issues/4263

I have a lot of follow-on work that's now being
tested, and I'm close to being able to re-enable
these assertions. That is, I'll be able to set
the information about therequest data before
calling ceph_osdc_start_request(), and at that
point I it should be possible to re-enable
the assertions.

I'll close this bug when I am able to do that.

Actions #9

Updated by Alex Elder about 11 years ago

I have posted this series of patches for review, and patch 4
resolves this issue. Patch 3 also tightens up the mds client
so it only sets message data when there are bytes to send, and
the last patch enables the assertions (including the one that
led to creating this issue).

[PATCH 0/5] ceph: abstract message data information setting
[PATCH 1/5] libceph: isolate message page field manipulation
[PATCH 2/5] libceph: isolate other message data fields
[PATCH 3/5] ceph: only set message data pointers if non-empty
[PATCH 4/5] libceph: set response data fields earlier
[PATCH 5/5] libceph: activate message data assignment checks

Actions #10

Updated by Alex Elder about 11 years ago

  • Status changed from In Progress to Resolved

This has been reviewed and is committed to the "testing" branch.
c61ffd1 libceph: set response data fields earlier

Actions

Also available in: Atom PDF