kernel messenger: need to support multiple ops per request
The kernel messenger currently gets message data from either
a bio list or a page vector. That is one or the other, and
exactly one of them.
An osd client request (which travels in a single ceph messeng)
currently includes a single osd "op", and as such, a single
source of data for that op is sufficient.
But in order to support layering, we need to be able to support
at least two ops in a single osd request. And each op will
need to supply its own block of data.
Specifically, a write request doing a "copy up" operation
needs to supply an op that contains the full set of data
for an underlying osd object (fetched from the parent image)
and another op that contains the new data to be written
on top of that data.
I have some general ideas on how to go about this but it's
going to involve a bit of work--more like a week than a day.
#1 Updated by Alex Elder about 8 years ago
- Status changed from New to In Progress
In order to support multiple ops for osd requests we're
going to have to do some work on the messenger as well.
I'm doing a little prototyping this morning in the messenger
code to see if an idea will pan out, so I'm marking this
in progress to indicate that.
#2 Updated by Alex Elder about 8 years ago
Here's what I'm prototyping.
I'm going to try to expand the definition of the trail portion of
a message so instead of simply a page list, it's a list of data
sources which have more than one possible type (the first of which
will be pagelist, but in addition it'll allow a page array).
Ideally this generalization can involve a bio list as well, and
then I'd like to replace the (pagelist, pages, bio and trail)
fields in a message (along with their other associated fields)
with something this one list of data sources.
Then, the data portion of a message will be represented as a series
of blobs of data, which can be presented in any of those ways
(which we already support now). And adding an op to an osd message
will involve adding the op along with whatever data it requires.
This will eliminate the restrictive ordering imposed by the
use of the trail field, and will hopefully simplify the messenger
code by handling things generically.
#6 Updated by Alex Elder almost 8 years ago
I haven't updated this in a long time.
I have now implemented the majority of what I described
in my "prototyping" comment from several weeks ago.
- There is a new generic "ceph message data item" type
that represents one of several possible sources (or
destinations) of data. A page array, a page list,
and a bio are all currently representable as a data
- The trail of a message is no longer used. A long
series of changes made it possible in steps to both
prove it was not needed and rearrange things so it
was no longer used.
- The pages, pagelist, bio (and trail) of a message
are no longer maintained separately; the message
simply has a single "data" field that represents
any/all types of data that might be required for
a message. (This isn't fully checked in but will
be in the next day or so.)
- An osd request now computes the length of its
outgoing data based on the array of ops that are
enclosed in the request.
- Separately, Sage added support for a "new" message
protocol (apparently around for some time in the
user space code) that allows individual blocks of
data for each op in a request's op array to be
described separately in the response message.
So this work is finally nearing completion, but
there are a few other things to get finalized first.
#10 Updated by Alex Elder almost 8 years ago
What I've been trying to do today is make it so a message's
header length is defined in a centralized location, based
on the actual data content stored in the message. That way
I can append data to the message without concern for errant
code overwriting what's already been added.
I've been creating abstract interfaces routines in order to
find and isolate code that accesses it directly, then update
it to use the interface. ceph_osdc_build_request() is a
nice routine that takes an array of osd ops and encodes them
into the message, and that is where I plan to add things up.
But in ceph_writepages_start() (which indirectly uses
ceph_osdc_build_request() via ceph_osdc_new_request()),
it sort of surgically changes the message content, both
the op and the data length in the header, after it
has been formatted by ceph_osdc_build_request().
This wouldn't be such an issue but I'm trying to hurry up
and finish some stuff before going on vacation, and I'm
just not sure I'll be able to at this point because that
function is a bit too complicated to mess with at this
point in the day...
Fortunately, the other users of ceph_osdc_new_request()
don't do that sort of heresy: start_read(), ceph_sync_write(),
ceph_osdc_readpages(), and ceph_osdc_writepages() all follow
the general form:
sets up data portion of the message
And that works fine for what I'm trying to do.
I'll see if I can figure out how to make ceph_writepages_start()
avoid twiddling bits in the message. I could just leave that
as a weird anomaly but I really hate doing that.
#12 Updated by Alex Elder almost 8 years ago
I forked off another bug one day for some work I was
doing related to this--basically layout a specific
plan once it became clear what needed to be done.
We agreed that only one or the other was needed, so
I'm going to retain this one and close the other.
I'm not going to transfer over information from that
other bug; it should be stil available here if necessary:
I have implemented and committed most of what was
- we no longer have separate pagelist, pages, bio,
and trail portions of a message, we just have a
single "data" structure that describes any of
those equally well
- that is currently a structure, but the change to
make it a pointer is pending, just not committed
yet (it sits behind getting bug 4450 fixed, which
I hope should be quick)
The last part of this will be to use a list of
these rather than a single one, and then use that
in the osd client.
#14 Updated by Alex Elder almost 8 years ago
Here are some patches that I just posted for review which
were part of implementing this, but were more along the
lines of cleanup and didn't warrant any other sub-issues:
[PATCH 02/20] libceph: compute incoming bytes once
[PATCH 03/20] libceph: define osd data initialization helpers
[PATCH 04/20] libceph: define a few more helpers
[PATCH 05/20] libceph: define ceph_osd_data_length()
[PATCH 06/20] libceph: a few more osd data cleanups
[PATCH 09/20] libceph: rename data out field in osd request op
[PATCH 15/20] libceph: format class info at init time
(These two probably belong together, and the last one
especially probably deserves its own issue. Oh well.)
#16 Updated by Alex Elder almost 8 years ago
- Status changed from In Progress to Fix Under Review
The following patches have been posted for review:
[PATCH 1/6] libceph: record bio length
[PATCH 2/6] libceph: move cursor into message
[PATCH 3/6] libceph: have cursor point to data
[PATCH 4/6] libceph: replace message data pointer with list
[PATCH 5/6] libceph: implement multiple data items in a message
#17 Updated by Alex Elder almost 8 years ago
Hopefully this will be it.
I have one more patch, which adds a second outgoing data item
for a osd class method CALL op, and which will resolve this:
It works in my UML environment but not on "real" Linux. It's
possible the problem is with the messenger code, so I mention
it here. But it not, I think that's it!
#18 Updated by Alex Elder almost 8 years ago
I updated one of the patches posted:
[PATCH 5/6, v2] libceph: implement multiple data items in a message
Here's my explanation:
I found a problem in the first version of this.
ceph_msg_data_cursor_init() was mistakenly initializing
total_resid a second time, to the wrong value. We want
to use the passed-in length, not the length of the data
available. For write requests they're the same, but
read requests can be short, so this was wrong.
#19 Updated by Alex Elder almost 8 years ago
- Status changed from Fix Under Review to Resolved
The following have been committed to the "testing" branch
of the ceph-client git repository:
b0ae840 libceph: compute incoming bytes once
4046404 libceph: define osd data initialization helpers
35a8106 libceph: define a few more helpers
9b61fa2 libceph: define ceph_osd_data_length()
e860930 libceph: a few more osd data cleanups
ef6fba1 libceph: rename data out field in osd request op
5a735d2 libceph: format class info at init time
81fe247 libceph: record bio length
8267c68 libceph: move cursor into message
c4121fa libceph: have cursor point to data
076ae53 libceph: replace message data pointer with list
ccba6d9 libceph: implement multiple data items in a message