Feature #2770
krbd: define tasks to add osd_client compound class op support
100%
Subtasks
Related issues
History
#1 Updated by Anonymous over 11 years ago
- translation missing: en.field_story_points set to 5
- translation missing: en.field_position deleted (
26) - translation missing: en.field_position set to 26
#2 Updated by Josh Durgin about 11 years ago
- translation missing: en.field_position deleted (
61) - translation missing: en.field_position set to 1
#3 Updated by Ian Colle about 11 years ago
- Target version set to v0.58
- translation missing: en.field_position deleted (
1) - translation missing: en.field_position set to 1
#4 Updated by Ian Colle about 11 years ago
- Assignee set to Alex Elder
#5 Updated by Alex Elder about 11 years ago
- Subject changed from krbd: refactor osd_client to improve compound class ops to krbd: define tasks to add osd_client compound class op support
- Status changed from New to In Progress
At our sprint planning meeting we discussed this. The task
was too large and unknown to provide a meaningful estimate
of the time required. So I was going to spend some time
looking at the code and creating smaller subsidiary tasks
based on what I found.
These specifics about the plan were not really captured
very well here. So this is an attempt to clarify that.
I began working on this Friday, and am finally marking it
"in progress" now.
#6 Updated by Alex Elder about 11 years ago
I thought I'd start with a statement of the problem to be solved. In going through that exercise I came up have with a way to do things that's easy but not beautiful, and I'll describe that separately. Unfortunately, I also learned in the process that while I thought copyup was an osd op, it is in fact implemented as an osd method call. So I need to re-think this a bit more. In any case, I'm posting the following just to preserve what I've been looking at today. --- In order to support a write to a layered rbd image we need to be able to supply a single osd request which contains two osd operations ("ops"), each of which carries a data payload. Such a request will complete both (or all) ops within it transactionally, meaning they either all complete successfully or none complete successfully, and no external activity (i.e., other concurrent requests) will affect the outcome. This is needed to support combining a copy-up operation along with a write operation for an "rbd object" (which holds the data backing an rbd image) of a layered rbd image (a "clone"). This is done the first time an rbd object is written to in a clone image. The rbd client must only write to rbd objects in a clone if they already exist. If one does not exist, the data covering that entire object must be copied from the image's parent image; after this, the original write can proceed. Reading this data from the parent and then writing it is the rbd client's responsibility. It finally supplies to the target osd the data from the parent along with the original data in a single request. These two writes are the ones that need to be done transationally. The data from the parent image is supplied first as the data for a copyup operation. The original write request is supplied as the second operation in the request. The copyup operation is a conditional write; the osd (server) will write the data if it (still) does not exist, otherwise it will keep the existing data as-is. Then the write operation will be applied on top of that. --- The kernel rbd client uses the following interface routines presented by the osd client: ceph_osdc_alloc_request() Allocates and initializes a ceph_osd_request structure capable of holding a specified number of ops (1). The request will have preallocated ceph_msg structures for messages to hold the request and its response. The result is a reference counted object. ceph_osdc_build_request() Fills in the osd request's request message using information supplied by arguments, as well as some information in the osd request. An array of osd operation is supplied, and those are encoded into an array in the request message. Each osd op includes certain information, and for some of them, additional data is appended to the "trail" of the message, which is ceph_pagelist abstraction. ceph_osdc_start_request() This transfers some final information from the osd request into its request message, "maps" the request (which determines exactly where to send it), and supplies it to the messenger to get it sent. osd_req->r_callback This is a callback function supplied with the osd request. When a response from a submitted request is received, the osd client calls this function and supplying the original osd request and the response message (which I believe will always be the reply message allocated when the osd request was created). ceph_osdc_put_request() This drops a reference to an osd request, which is used to free it when it's complete. There are a few other routines related to events and watch requests and notifications, but they're not important for the current discussion. --- Currently, all requests issued by the kernel rbd client to an osd contain a single osd op. That is, the array of ops supplied to ceph_osdc_build_request() contains a single entry. Except for the the watch/notify ops, the kernel rbd client issues three op types to osds: CEPH_OSD_OP_CALL This is a class method call, basically an RPC. The rbd client supplies the name of a class and a method, along with other input data of arbitrary length to be sent to the osd. The osd client encodes all of these in the request message's "trail" portion (a ceph_pagelist). The rbd client also supplies a page vector (an array of page pointers), whose pages are used to receive data returned by the method. CEPH_OSD_OP_WRITE The only objects written by the kernel osd client are rbd image data objects. These are always written using osd requests containing bio lists as a source of the data to be sent. CEPH_OSD_OP_READ For version 1 rbd images, the kernel osd client reads the image's header object using a READ op, supplying an array of pages into which the received data should be placed. Otherwise, the only (and majority of) reads are performed on rbd image data objects, using bio lists to indicate where received data should be placed. --- What the rbd client needs to be able to do for layered writes is issue a single osd request--which is encoded in a single ceph message--containing both a "copyup" operation (and its associated full rbd object data) and a write operation (with 1 byte up to a full rbd object of data). The interface to the kernel osd client used by the kernel rbd client can be extended to support these two ops. However, the data from the two ops comes from two distinct places, and this doesn't necessarily match the way the messenger handles data. The kernel messenger allows data for a message to be sent in one of three ways: using a single array of pages (all "full" except possibly the first and last); using a single list of pages (all "full" except possibly the last); or using a single list of Linux bio structures. The natural way for the existing messenger code to handle having multiple sources of data to be written would be to concatenate together those sources using whichever of the three ways used for a given message. I.e., have all ops use the same array of pages with their data concatenated; or the concatenated into the same page list; or concatenated in the same bio list. As mentioned above, the kernel rbd client now only uses bio's for writes. And while concatenating them for the messenger is possible--and maybe even efficient--it is unnatural and I think dangerous do abuse the bio structures this way. The page list interface is designed to do this sort of concatenation easily. But it allocates new pages and copies data into them for everything sent this way. This is not acceptable for the rbd I/O path. The page array interface could similarly be used to hold both the original data and the data to be written. But that too would require the original write request (described in bio structures) to be copied into new pages before providing it to the messenger. --- The ceph messenger manages sending messages for its clients, and when data arrives on a connection, determining which client it is destined for and allocating a message to receive the data and pass it along. A ceph message is made up of several distinct components, which are sent over the wire (little endian byte order) in this order: message tag (CEPH_MSGR_TAG_MSG = 7) This one-byte value concisely indicates that a ceph message follows. ceph message header This fixed-size structure describes the message, including the lengths of its front, middle, and data portions. The header ends with a 32-bit crc computed over everything in this structure that precedes it. A sequence number is included to ensure in-order delivery allowing duplicate messages to be ignored. (This is needed because a ceph connection can span multiple TCP connections.) front This is effectively the messenger client's header field. Its length is arbitrary (defined in the ceph message header). For messages sent to the osd client, this will be comprised of a ceph_osd_request_head structure. The messenger allocates this when a new ceph_msg structure is created using ceph_msg_new(). middle This is an optional field (i.e., possibly zero-length). It is a single block of arbitrary-length data, represented as a ceph_buffer structure before it hits the wire. The rbd client currently does not use this field. It is used by the file system to send extended attributes to the MDS. data Also optional, but if present it can be represented in one of several ways before it hits the wire. The amount of data sent is recorded in the ceph message header's data_len field (reduced by the length of the trail, described next). Whichever of the following is a non-null pointer (checked in this order) is used as the source of data to send: - An array of page ceph_msg->nr_pages page pointers, represented by ceph_msg->pages. - A list of pages, represented by the ceph_pagelist referred to by ceph_msg->pagelist. - A Linux bio list, referred to by msg->bio. - A zero page otherwise (as many times as needed). trail Logically, the trail is part of the message data. It is only present if the data field is non-empty, and its length is accounted for in data_len. It is treated separate from the rest of the data, however, and is always represented as a ceph_pagelist before it gets sent over the wire. It seems to be present to allow a pagelist to be used for sending data, while allowing a page array to be used for receiving it. footer This is a small fixed-size strucutre containing CRC-32 calculations over the front, middle, and data portions of the preceding message. It also contains a flags field, which indicates whether the message data CRC is valid (the message metadata CRCs are always used), and a non-zero bit indicating the message was complete. (The latter is to allow for messages to be aborted before they hit the wire--guaranteeing that zeroes received on the other end will cause the message to be rejected.) It is the data portion of the message that could be problematic for supporting multiple osd ops containing data. A received ceph message will of course have the same structure described above. It goes like this: message tag (CEPH_MSGR_TAG_MSG = 7) If the first byte indicates a message follows, then we'll continue receiving more data as described below. ceph message header A ceph connection data structure includes an "in_hdr" field used to hold this incoming data. If its crc is bad, the message is dropped and the connection is reset. The lengths of the front, middle, and data sections of the message that follows are extracted (and if their values are out of supported range, the connection is reset). The sequence number is checked; duplicate messages are dropped and missing messages lead to connection reset. Finally a message is allocated (for the osd client, this is done by looking up the reply message allocated when the osd request was first created), and the already received header is copied into it. This message is used to receive the remaining message data. front The front portion is read into the message's allocated front buffer, for as many bytes as is indicated in the received message header. middle If the message has a middle portion allocated, it is filled is with the number of bytes defined in the received message header. data Data is read into pages set aside to receive them. For receiving, only two mechanisms are available--a page array, or a bio list. footer Finally the footer is read into the allocated message. If any CRCs aren't what was expected the message is dropped and the connection is reset. A message received successfully is passed to the intended connection owner's dispatch() method, which for the osd client will be the function: net/ceph/osd_client.c:dispatch() For now the receive path is fine, we have no current plans to require multiple ops receiving data. (It may be that solving the send side generically could be used for receives, however.)
#7 Updated by Alex Elder about 11 years ago
Taking all of the above into account, I came up with a possible solution. I don't like its aesthetics, but I think it could work for this specific purpose. The kernel rbd client doesn't use the "middle" portion of a message. That portion is allowed to be any length, as long as it's a contiguous block of data. It is sent before the data portion, and the copyup operation might be a special case that looks there for its data. So if we had a single virtually contiguous buffer containing the data read from the parent image we could supply that to the ceph messenger as a ceph buffer representing the middle portion of the message. But as I mentioned in my last post, I now know that the "copyup" operation is implemented as a method call, so I need to focus a little more into how those are handled and whether that can be expanded to handle either bios or buffers. So I'll do that next.
#8 Updated by Alex Elder about 11 years ago
The way the osd client handles an object class method right now
assumes that outbound data (headed from the client to the osd)
is added to the trail portion of the message. This involves
copying the data into pages in the trail pagelist, adding new
pages as necessary.
For the huge data transfers we're doing here that won't do,
and I've created http://tracker.ceph.com/issues/4104 to suggest
allowing a page array to be supplied and used instead.
There is another problem with this, but it's not new. The trail
by definition is the end of the data portion of the message. This
imposes ordering constraints on the ops that are contained in
the op array in an osd request. Specifically, a WRITE op cannot
follow a CALL, because the data for the WRITE will get sent
at the beginning of the data portion of the message, while the
data for the CALL will be in the trail, necessarily following
that. There are other restrictions, e.g. you couldn't put an
XATTR or NOTIFY operation after a WRITE in the same osd request
(if one ever wanted to do that).
As I said, this is an existing problem, a consequence of creating
this notion of "trailing" data.
This CALL followed by WRITE that doesn't work is precisely what
we need to occur for a layered write. What's needed might need
to affect the way that more than just a CALL operation is
represented in the osd client.
#9 Updated by Alex Elder about 11 years ago
After the walk through described above, I spent a little more
time thinking about the messenger aspect of allowing multiple
osd ops to provide (or receive) data in a single message.
The need to do this (or perhaps the need to both send data
and receive data for the single class method op) seems to
have been the motivation for defining the "trail" portion of
a message, but that mechanism won't work for what we need
here.
So as expected, to support the changes needed here, the
messenger needs to have a way to provide the data for a
compound class op, which is the subject of issue 3761
and I believe a separate task as well.
It wasn't clear how to proceed with that though, so I spent
some time yesterday prototyping some changes to the messenger.
I updated http://tracker.ceph.com/issues/3761 with some
information about what I was trying to do.
I have some new smaller tasks to define that are more
closely associated with the messenger than the osd client
though. I'm going to create them, but I'm not sure I'll
get them connected correctly within redmine.
#10 Updated by Sage Weil about 11 years ago
- Status changed from In Progress to Resolved
#11 Updated by Sage Weil about 11 years ago
- Status changed from Resolved to In Progress
- Target version changed from v0.58 to v0.59
#12 Updated by Ian Colle about 11 years ago
- Target version changed from v0.59 to v0.60
#13 Updated by Ian Colle about 11 years ago
- Target version changed from v0.60 to v0.61 - Cuttlefish
#14 Updated by Alex Elder almost 11 years ago
OK, here are my plans for finishing this up.
First, http://tracker.ceph.com/issues/3861 defines work that
consolidates the code that builds osd operations. That stuff
is out for review now. But this issue depends on that so
the code that operates on ops is contained.
Next, once http://tracker.ceph.com/issues/3761 is complete,
a message will be able to have multiple distinct sources of
data. This goes a long way toward supporting this, so
finishing that is also prerequisite.
Once that's in place it'll be a matter of "adding" data to a
request rather than "setting" it when building an osd op.
That may be all there is to completing this issue--mostly
completing other prerequiste work. I'll see how things look
when I get this far, but I think that will be the point at
which I close this issue.
Finally, http://tracker.ceph.com/issues/4104 defines
how a page array needs to be used as the data for
an osd class operation. That doesn't strictly need
to be completed before this issue is considered complete,
but it is required for our final use of this feature--to
supply multiple blobs of data for a single request.
#15 Updated by Sage Weil almost 11 years ago
- Target version changed from v0.61 - Cuttlefish to v0.62a
#16 Updated by Alex Elder almost 11 years ago
- Status changed from In Progress to Fix Under Review
The following patch has been posted for review:
[PATCH 6/6] libceph: add, don't set data for a message
#17 Updated by Alex Elder almost 11 years ago
- Status changed from Fix Under Review to Resolved
The following has been committed to the "testing" branch
of the ceph-client git repository:
436b0c0 libceph: add, don't set data for a message