Project

General

Profile

Actions

Subtask #4125

closed

Feature #3761: kernel messenger: need to support multiple ops per request

kernel messenger: support multiple sources of data

Added by Alex Elder about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

We need to allow the osd client to have an array of osd ops
such that each can supply its own buffer for sending and/or
receiving data to the osd.

There was support added for supplying small amounts of data
to be sent, along with an array of pages into which response
data can be received at one time. This took the form of the
"trail" of a message, which was put in place to allow a class
method call to define its class, method, and input data as
well as the buffer to hold the result. But this is a little
bit kludgy and in any case is restrictive in ways that won't
work for what we need.

I came up with an approach I'd like to pursue to implement
this though. I can and will break it down into smaller tasks
related to this one.

The basic idea is to generalize the way the trail of a message
is defined, so that instead of a pagelist, it becomes a list
of descriptors, each of which defines a place to hold data to
send. More than one type of data holder will be allowed,
including a pagelist, but also including an array of pages as
well as a bio list. Interface routines will allow adding a new
data holder of a given type to the end of the trail.

When the messenger is sending the trail portion of the message
it will cycle through each of these, providing pages from each
data holder in calls to sendmsg() to send them over the wire.

My hope is that this capability can then be moved in to replace
the pages, pagelist, bio, and trail pieces in a ceph message
structure with a single unified chain of data descriptors.
Or more likely, two chains, one describing data to send and
one describing places to put received response data.

OK, that's the general idea, I'll offer more with some updates.


Subtasks 2 (0 open2 closed)

Subtask #4401: libceph: record bytes not page count fore requestsResolvedAlex Elder03/09/2013

Actions
Subtask #4589: libceph: consolidate maintenance of message data lengthResolvedAlex Elder03/29/2013

Actions
Actions #1

Updated by Alex Elder about 11 years ago

Here's a short list of smaller tasks I can identify right
now in order to implement this. I may create a new issue
for each, but for now I'm just going to document my list.

- Modify the trail portion of a message so instead of
being a ceph pagelist pointer, it's a "message data"
structure containing one of those pointers.
- Define accessor routines to ensure the trail data is
treated everywhere as an abstraction fully contained
within the messenger code.
- Add a type field to the message data structure, to
indicate how the data is formatted (initially just
a pagelist).
- Abstract the way write_partial_msg_pages() gets the
next page, page offset, and length to send while
processing the tail portion of the message.

At that point I need to do a little more to know the
right way to proceed, but generally:
- Make the shift from the trail being a message data
structure to being a list containing message data
structures.
- Maybe abstract how all data pages (not just the
trail) get selected in write_partial_msg_pages().
- Maybe add to the types supported by the message
data structure, e.g. page array.
- Maybe work on the "out_msg_pos" field in a ceph
connection to reflect these changes.

Actions #2

Updated by Alex Elder about 11 years ago

I've been winding my way through code this morning and
I've discovered there's something in the ceph message
protocol that may interfere with me doing what I had
intended to do.

A ceph message header contains a single "data_off" field
which is an offset into a page into which the received
data is to land. Because this is a message field it
means it is unable to represent a data offset for more
than one hunk of data--it assumes a single block of
data arrives, at the given offset.

I think it may be OK if we can just ensure both sides
of an osd client connection always use 0 for this value
in the message header, and handle the data defined by
individual ops within the osd message in a way that
takes care of whatever page offsets are required for
the type of data being sent/received.

Actions #3

Updated by Alex Elder about 11 years ago

Another thing about that field. In ceph_osdc_build_request(),
the 64-bit "off" value is subject to this:

req->r_request->hdr.data_off = cpu_to_le16(off);

I have no idea what the result (on the osd) is if the offset
is > 65536 bytes (which I am pretty sure it can be). Maybe
the hdr.data_off value is ignored on the osd side?

Actions #4

Updated by Sage Weil about 11 years ago

That field has no semantic meaning. All it does is control the alignment of the memory the OSD reads into, which will let it avoid copying the data (again) to realign if it needs to do direct-io at some point.

Ideally, it would be set so that the largest chunk of data in the request is aligned properly.

Actions #5

Updated by Alex Elder about 11 years ago

That field has no semantic meaning. All it does is
control the alignment of the memory the OSD reads
into, which will let it avoid copying the data (again)
to realign if it needs to do direct-io at some point.

Interesting. Now I understand its purpose, although the
client has to know what the page size is (or at least
what the direct I/O alignment requirements are) of the osd
(server) to actually be of benefit. (64KB should be very
safe; 4KB--and even 512B--might be sufficient.)

It seems to me the ideal case has an offset value of 0,
otherwise read/modify/write is required somewhere.

On the kernel side I think I'll be able to avoid all such
extra copies (on reads anyway). And I think could get rbd
to avoid them on writes as well.

Actions #6

Updated by Alex Elder about 11 years ago

I should have updated this before I guess.

It turns out I implemented something fairly similar to what
was originally described, but I did not use the trail of the
message. After some analysis of how the message was actually
used I learned that it was possible to kill off the trail
entirely and just use a single data item for all types of
data.

So there is a single "message data" structure in a message,
it's just not the trail.

There is a type field, and there are accessor functions
that allow the data transfer functions to treat message
data generically.

The out_msg_pos (and its incoming counterpart) are now gone
in favor of cursors that match the generic data types.

Overall I've arranged things pretty much along the lines
of what I envisioned a month ago.

Actions #7

Updated by Alex Elder about 11 years ago

Another status update, based on the history in this issue.

Sage straightened me out about the "data_off" field. It
is an optimization so that the buffer allocated for received
data can be allocated such that the data lands in a naturally
aligned position, so that large quantities of incoming data
can be submitted to the backing storage using direct I/O
to avoid an extra copy. I haven't changed any of how that
works.

The fact that "only" 16 bits worth of this alignment data
is sent over the wire is just fine.

I still don't like this line of code in ceph_osdc_build_request():
req->r_request->hdr.data_off = cpu_to_le16(off);
because "off" is a 64-bit value.

For the most part though, I would say that I'm satisfied that the
content of this issue has been completed. It's all easily
subsumed by the broader http://tracker.ceph.com/issues/3761.

My plan now is to leave the fix of that silly cpu_to_le16(off)
issue as the final thing to do on this issue, and when that's
done I'll mark it resolved.

Actions #8

Updated by Alex Elder about 11 years ago

  • Status changed from In Progress to Resolved

As mentioned, I fixed a simple problem (which was
more a problem for the reader than the computer),
and am therefore marking this issue complete.

The following has been committed to the ceph-client
"testing" branch:
bc7bdaa libceph: be explicit in masking bottom 16 bits

All other work related to this subtask will be handled
under:
http://tracker.ceph.com/issues/3761

Actions

Also available in: Atom PDF