Project

General

Profile

Bug #4392

kclient: libceph: osd ops send out of order

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm observing:

...
2013-03-08 09:20:43.532043 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 889 ==== osd_op(client.4121.1:2922 rb.0.1018.32fff902.000000000394 [write 1638400~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (2511056042 0 84184609) 0x28796c0 con 0x29cc840
2013-03-08 09:20:43.532115 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 890 ==== osd_op(client.4121.1:2923 rb.0.1018.32fff902.000000000394 [write 1650688~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (642626524 0 520042991) 0x28bf900 con 0x29cc840
2013-03-08 09:20:43.532262 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 891 ==== osd_op(client.4121.1:2924 rb.0.1018.32fff902.000000000394 [write 1675264~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (2150460411 0 333208667) 0x3dd0000 con 0x29cc840
2013-03-08 09:20:43.532394 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 892 ==== osd_op(client.4121.1:2925 rb.0.1018.32fff902.000000000378 [write 3526656~4096] 2.cbde4900 e6) v4 ==== 170+0+4096 (2191776161 0 3788854758) 0x3dd0b40 con 0x29cc840
2013-03-08 09:20:43.532568 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 893 ==== osd_op(client.4121.1:2927 rb.0.1018.32fff902.000000000394 [write 2015232~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (1701526991 0 3978754102) 0x3dd0900 con 0x29cc840
2013-03-08 09:20:43.532746 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 894 ==== osd_op(client.4121.1:2928 rb.0.1018.32fff902.000000000394 [write 2240512~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (2560037934 0 2868297069) 0x4c5e6c0 con 0x29cc840
2013-03-08 09:20:43.532881 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 895 ==== osd_op(client.4121.1:2929 rb.0.1018.32fff902.000000000394 [write 2859008~8192] 2.894bd26d e6) v4 ==== 170+0+8192 (3944655821 0 3127406912) 0x4c5e480 con 0x29cc840
2013-03-08 09:20:43.533021 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 896 ==== osd_op(client.4121.1:2930 rb.0.1018.32fff902.000000000394 [write 3117056~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (1220613452 0 2585486749) 0x4c5e000 con 0x29cc840
2013-03-08 09:20:43.533158 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 897 ==== osd_op(client.4121.1:2926 rb.0.1018.32fff902.000000000394 [write 1826816~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (1708157385 0 45662994) 0x488fb40 con 0x29cc840
2013-03-08 09:20:43.533362 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 898 ==== osd_op(client.4121.1:2931 rb.0.1018.32fff902.000000000394 [write 3452928~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (2409703780 0 3821444837) 0x4ab9d80 con 0x29cc840
2013-03-08 09:20:43.533503 7f518e4e8700  1 -- 10.214.131.25:6803/9175 <== client.4121 10.214.131.21:0/1062688625 899 ==== osd_op(client.4121.1:2932 rb.0.1018.32fff902.000000000394 [write 3956736~4096] 2.894bd26d e6) v4 ==== 170+0+4096 (2391419755 0 3592709662) 0x4ab9b40 con 0x29cc840
...

notice that 2926 is sent out of order.

History

#1 Updated by Alex Elder about 11 years ago

All osd requests are sent in the order they sit on the
osd client req_unsent list. The tid for an osd request
is set __register_request().

There are three places in the osd client where requests
are added to the unsent list.

One place is __map_request(). There, new requests for
which the target osd is recognized are added to the list.
The tid is expected to have already been set by that point.

The other two are in __kick_osd_requests(), where sent
but incomplete messages are moved there from the o_requests
list. In this case the tid is left unchanged (and I expect
if there are any unsent message at this point their tids
will be less than those added in this process).

Second, __kick_osd_requests() adds lingering requests back
to the unsent list, but in this case the requests are
re-registered so this should not cause this problem.

Three places call __map_request(). The first two are
in kick_requests(). The first re-maps any non-linger
request, and does nothing with the tid, so I expect
this can lead to out-of-order tids in sent requests.

The second re-maps lingering requests, before it
re-registers the request. (The order should be OK
though because it does these under protection of the
request mutex.)

The third place __map_request() is called is in
ceph_osdc_start_request(). That calls __map_request()
right after registering the request (although it
drops and re-acquires the osdc mutex between those
two calls).

So in summary, I think there are three problem areas:
- In ceph_osdc_start_request(), the request should
be registered just before sending, without dropping
the request mutex.
- The second is in kick_requests(), where sent requests
are appended to the unsent list without regard for
the tid.
- The third is in __kick_osd_requests(), which suffers
the same problem as kick_requests() (but just for one
osd's requests).

The first one is probably the least troublesome spot.
I'm not sure why the first case doesn't register the
request while holding the request already, it looks
entirely possible without any trouble. (This seems
to be a fairly old problem; this code hasn't changed
in quite a while.)

I think we may be able to address the last two by building
a list of the affected requests first, then pre-pending
that list to the unsent list. All of the sent requests
should have (by definition, or by induction at least) lower
tid than anything not yet sent. (I'm pretty sure this is
fairly old behavior as well; the new osd code is just
reporting seeing it now.)

Sage, what do you think?

#2 Updated by Sage Weil about 11 years ago

On the userland side of things, we put the requests to be resent into a map<tid,Request*> so that they get sorted before being requeued. I forget exacty what subtlety forced us to that point. If we can get away with manipulating the lists in the right order, that's great...

#3 Updated by Alex Elder about 11 years ago

  • Status changed from New to In Progress

I have a series of patches I'm now testing that I believe
will address this. They don't do exactly what I suggested,
but the underlying point is the same.

Basically, osd requests are added to the unsent list with
a little more attention paid to whether they should be
added at the front of the list (when in-flight or unsent
requests get re-queued) or the back (when adding the
request to the list follows registration of the request).

There were several spots things had to be changed because
they were being added to the wrong end of the list.

(It may be that these cases didn't matter--couldn't
happen--but that's not very easy to see.)

(Whoops, I thought I already submitted this a few hours ago).

#4 Updated by Alex Elder about 11 years ago

  • Status changed from In Progress to Fix Under Review

The following patches have been posted for review.

[PATCH 1/6] libceph: slightly defer registering osd request
[PATCH 2/6] libceph: no more kick_requests() race
[PATCH 3/6] libceph: prepend requests in order when kicking
[PATCH 4/6] libceph: keep request lists in tid order
[PATCH 5/6] libceph: send queued requests when starting new one
[PATCH 6/6] libceph: verify requests queued in order

#5 Updated by Alex Elder about 11 years ago

I tested (am testing) these patches overnight.

I ran my xfstests over rbd task with this in it:

- ceph:
branch: next
conf:
global:
ms inject socket failures: 5000
mon:
debug mon: 10
debug paxos:
osd:
debug ms: 1
debug journal: 20
osd debug op order: true

The testing branch failed at around test 70, after I
think under an hour of running.

I ran the same yaml file using a kernel with my patches
applied. One complete pass completed on a cluster of
3 mira machines, and 2 (the third pass is nearly complete)
were successful on a cluster with plana machines.

Sage mentioned to me a possible problem in my last patch,
the one that warns if it sees out-of-order requests put
on the unsent list, but otherwise this all bodes well.

#6 Updated by Alex Elder about 11 years ago

Sage explained something to me yesterday after taking a
quick look at my posted patches:
- the requirement for requests arriving at the osd in
tid order is imposed for requests targeted to a given
object from a given client (not all objects for a
given client)
So implied by that is the fact that request tids may
be out of order, but it's OK as long as they're for
different objects.

After thinking about that last night I have revised my
patches a little bit.

First, only previously-sent requests for an object will
be re-queued on the unsent list. Previously all requests
for the object--whether they'd been sent or not--were being
removed from whichever list they were on (lru or unsent)
and placed back onto the unsent list. Now if a request is
unsent it's not removed, and this keeps such requests in
the correct (ordered) position on that list. Already-sent
requests are prepended to the unsent list, in their original
order. This will also keep the unsent list sorted, because
any sent request will have a lower tid than any existing
unsent one.

There are two improvements implied by this:
- all requests on the unsent list will remain in order, always,
even across requests targeted to different object
- unsent requests targetd to a reset osd will not be
rescheduled with respect to other pending requests.
(Previously they were all being moved to the front
of the unsent queue, even if they'd never been sent
before.)

I'm about to test my updated code and unless something
goes terribly wrong I'll post the new patches for review
shortly.

#7 Updated by Alex Elder about 11 years ago

The following updated patches have been posted for review:

[PATCH 3/6, v2] libceph: requeue only sent requests when kicking
[PATCH 6/6, v2] libceph: verify requests queued in order

#8 Updated by Alex Elder about 11 years ago

  • Status changed from Fix Under Review to Resolved

The following changes have been committed to the ceph-client
"testing" branch:

0518f43 libceph: send queued requests when starting new one
ed3f1e8 libceph: keep request lists in tid order
c58d8e4 libceph: requeue only sent requests when kicking
3742149 libceph: no more kick_requests() race
39a151f libceph: slightly defer registering osd request

Also available in: Atom PDF