Feature #2770


krbd: define tasks to add osd_client compound class op support

Added by Sage Weil almost 12 years ago. Updated about 11 years ago.

Target version:
% Done:


Affected Versions:
Pull request ID:

Subtasks 3 (0 open3 closed)

Subtask #4126: kernel osd client: kill off some dead codeResolvedAlex Elder02/14/2013

Subtask #4127: kernel osd client: clearly separate read and write data buffersResolvedAlex Elder02/25/2013

Subtask #4263: libceph: clearly abstract message data operationsResolvedAlex Elder02/25/2013


Related issues 2 (0 open2 closed)

Related to rbd - Feature #4104: osd_client: support passing page array as data for CALL opResolvedAlex Elder02/12/2013

Has duplicate rbd - Feature #2850: libceph: support multi-operation transactionsDuplicateAlex Elder07/26/2012

Actions #1

Updated by Anonymous over 11 years ago

  • Translation missing: en.field_story_points set to 5
  • Translation missing: en.field_position deleted (26)
  • Translation missing: en.field_position set to 26
Actions #2

Updated by Josh Durgin about 11 years ago

  • Translation missing: en.field_position deleted (61)
  • Translation missing: en.field_position set to 1
Actions #3

Updated by Ian Colle about 11 years ago

  • Target version set to v0.58
  • Translation missing: en.field_position deleted (1)
  • Translation missing: en.field_position set to 1
Actions #4

Updated by Ian Colle about 11 years ago

  • Assignee set to Alex Elder
Actions #5

Updated by Alex Elder about 11 years ago

  • Subject changed from krbd: refactor osd_client to improve compound class ops to krbd: define tasks to add osd_client compound class op support
  • Status changed from New to In Progress

At our sprint planning meeting we discussed this. The task
was too large and unknown to provide a meaningful estimate
of the time required. So I was going to spend some time
looking at the code and creating smaller subsidiary tasks
based on what I found.

These specifics about the plan were not really captured
very well here. So this is an attempt to clarify that.

I began working on this Friday, and am finally marking it
"in progress" now.

Actions #6

Updated by Alex Elder about 11 years ago

I thought I'd start with a statement of the problem to be solved.

In going through that exercise I came up have with a way to do
things that's easy but not beautiful, and I'll describe that

Unfortunately, I also learned in the process that while I thought
copyup was an osd op, it is in fact implemented as an osd method
call.  So I need to re-think this a bit more.

In any case, I'm posting the following just to preserve what
I've been looking at today.


In order to support a write to a layered rbd image we need to be
able to supply a single osd request which contains two osd
operations ("ops"), each of which carries a data payload.  Such a
request will complete both (or all) ops within it transactionally,
meaning they either all complete successfully or none complete
successfully, and no external activity (i.e., other concurrent
requests) will affect the outcome.

This is needed to support combining a copy-up operation along with a
write operation for an "rbd object" (which holds the data backing an
rbd image) of a layered rbd image (a "clone").  This is done the
first time an rbd object is written to in a clone image.  The rbd
client must only write to rbd objects in a clone if they already
exist.  If one does not exist, the data covering that entire object
must be copied from the image's parent image; after this, the
original write can proceed.  Reading this data from the parent and
then writing it is the rbd client's responsibility.  It finally
supplies to the target osd the data from the parent along with the
original data in a single request.

These two writes are the ones that need to be done transationally.
The data from the parent image is supplied first as the data for a
copyup operation.  The original write request is supplied as the
second operation in the request.  The copyup operation is a
conditional write; the osd (server) will write the data if it
(still) does not exist, otherwise it will keep the existing data
as-is.  Then the write operation will be applied on top of that.


The kernel rbd client uses the following interface routines
presented by the osd client:

        Allocates and initializes a ceph_osd_request structure
        capable of holding a specified number of ops (1).  The
        request will have preallocated ceph_msg structures for
        messages to hold the request and its response.  The
        result is a reference counted object.

        Fills in the osd request's request message using information
        supplied by arguments, as well as some information in the
        osd request.  An array of osd operation is supplied, and
        those are encoded into an array in the request message.
        Each osd op includes certain information, and for some
        of them, additional data is appended to the "trail" of the
        message, which is ceph_pagelist abstraction.

        This transfers some final information from the osd request
        into its request message, "maps" the request (which
        determines exactly where to send it), and supplies it to the
        messenger to get it sent.

        This is a callback function supplied with the osd request.
        When a response from a submitted request is received, the
        osd client calls this function and supplying the original
        osd request and the response message (which I believe will
        always be the reply message allocated when the osd request
        was created).

        This drops a reference to an osd request, which is used
        to free it when it's complete.

There are a few other routines related to events and watch requests
and notifications, but they're not important for the current


Currently, all requests issued by the kernel rbd client to an osd
contain a single osd op.  That is, the array of ops supplied to
ceph_osdc_build_request() contains a single entry.

Except for the the watch/notify ops, the kernel rbd client issues
three op types to osds:

        This is a class method call, basically an RPC.  The rbd
        client supplies the name of a class and a method, along with
        other input data of arbitrary length to be sent to the osd.
        The osd client encodes all of these in the request message's
        "trail" portion (a ceph_pagelist).  The rbd client also
        supplies a page vector (an array of page pointers), whose
        pages are used to receive data returned by the method.

        The only objects written by the kernel osd client are rbd
        image data objects.  These are always written using osd
        requests containing bio lists as a source of the data to be

        For version 1 rbd images, the kernel osd client reads the
        image's header object using a READ op, supplying an array of
        pages into which the received data should be placed.
        Otherwise, the only (and majority of) reads are performed on
        rbd image data objects, using bio lists to indicate where
        received data should be placed.

What the rbd client needs to be able to do for layered writes is
issue a single osd request--which is encoded in a single ceph
message--containing both a "copyup" operation (and its associated
full rbd object data) and a write operation (with 1 byte up to a
full rbd object of data).

The interface to the kernel osd client used by the kernel rbd
client can be extended to support these two ops.  However, the data
from the two ops comes from two distinct places, and this doesn't
necessarily match the way the messenger handles data.

The kernel messenger allows data for a message to be sent in one
of three ways:  using a single array of pages (all "full" except
possibly the first and last); using a single list of pages (all
"full" except possibly the last); or using a single list of Linux
bio structures.  

The natural way for the existing messenger code to handle having
multiple sources of data to be written would be to concatenate
together those sources using whichever of the three ways used
for a given message.  I.e., have all ops use the same array of
pages with their data concatenated; or the concatenated into the
same page list; or concatenated in the same bio list.

As mentioned above, the kernel rbd client now only uses bio's for
writes.  And while concatenating them for the messenger is
possible--and maybe even efficient--it is unnatural and I think
dangerous do abuse the bio structures this way.

The page list interface is designed to do this sort of concatenation
easily.  But it allocates new pages and copies data into them for
everything sent this way.  This is not acceptable for the rbd I/O

The page array interface could similarly be used to hold both the
original data and the data to be written.  But that too would
require the original write request (described in bio structures) to
be copied into new pages before providing it to the messenger.


The ceph messenger manages sending messages for its clients, and
when data arrives on a connection, determining which client it is
destined for and allocating a message to receive the data and pass
it along.

A ceph message is made up of several distinct components, which are
sent over the wire (little endian byte order) in this order:

    message tag (CEPH_MSGR_TAG_MSG = 7)
        This one-byte value concisely indicates that a ceph message

    ceph message header
        This fixed-size structure describes the message, including
        the lengths of its front, middle, and data portions.  The
        header ends with a 32-bit crc computed over everything in
        this structure that precedes it.  A sequence number is
        included to ensure in-order delivery allowing duplicate
        messages to be ignored.  (This is needed because a ceph
        connection can span multiple TCP connections.)

        This is effectively the messenger client's header field.
        Its length is arbitrary (defined in the ceph message
        header).  For messages sent to the osd client, this will
        be comprised of a ceph_osd_request_head structure.  The
        messenger allocates this when a new ceph_msg structure
        is created using ceph_msg_new().

        This is an optional field (i.e., possibly zero-length).
        It is a single block of arbitrary-length data, represented
        as a ceph_buffer structure before it hits the wire.  The rbd
        client currently does not use this field.  It is used by
        the file system to send extended attributes to the MDS.

        Also optional, but if present it can be represented in one
        of several ways before it hits the wire.  The amount of data
        sent is recorded in the ceph message header's data_len field
        (reduced by the length of the trail, described next).
        Whichever of the following is a non-null pointer (checked in
        this order) is used as the source of data to send:
          - An array of page ceph_msg->nr_pages page pointers,
            represented by ceph_msg->pages.
          - A list of pages, represented by the ceph_pagelist
            referred to by ceph_msg->pagelist.
          - A Linux bio list, referred to by msg->bio.
          - A zero page otherwise (as many times as needed).

        Logically, the trail is part of the message data.  It is
        only present if the data field is non-empty, and its length
        is accounted for in data_len.  It is treated separate from
        the rest of the data, however, and is always represented as
        a ceph_pagelist before it gets sent over the wire.  It seems
        to be present to allow a pagelist to be used for sending
        data, while allowing a page array to be used for receiving

        This is a small fixed-size strucutre containing CRC-32
        calculations over the front, middle, and data portions of
        the preceding message.  It also contains a flags field,
        which indicates whether the message data CRC is valid (the
        message metadata CRCs are always used), and a non-zero bit
        indicating the message was complete.  (The latter is to
        allow for messages to be aborted before they hit the
        wire--guaranteeing that zeroes received on the other end
        will cause the message to be rejected.)

It is the data portion of the message that could be problematic for
supporting multiple osd ops containing data.

A received ceph message will of course have the same structure
described above.  It goes like this:

    message tag (CEPH_MSGR_TAG_MSG = 7)
        If the first byte indicates a message follows, then we'll
        continue receiving more data as described below.

    ceph message header
        A ceph connection data structure includes an "in_hdr" field
        used to hold this incoming data.  If its crc is bad, the
        message is dropped and the connection is reset.  The lengths
        of the front, middle, and data sections of the message that
        follows are extracted (and if their values are out of
        supported range, the connection is reset).  The sequence
        number is checked; duplicate messages are dropped and
        missing messages lead to connection reset.

        Finally a message is allocated (for the osd client, this is
        done by looking up the reply message allocated when the osd
        request was first created), and the already received header
        is copied into it.  This message is used to receive the
        remaining message data.

        The front portion is read into the message's allocated
        front buffer, for as many bytes as is indicated in the
        received message header.

        If the message has a middle portion allocated, it is filled
        is with the number of bytes defined in the received message

        Data is read into pages set aside to receive them.  For
        receiving, only two mechanisms are available--a page array,
        or a bio list.

        Finally the footer is read into the allocated message.
        If any CRCs aren't what was expected the message is dropped
        and the connection is reset.

A message received successfully is passed to the intended
connection owner's dispatch() method, which for the osd client will
be the function: net/ceph/osd_client.c:dispatch()

For now the receive path is fine, we have no current plans to 
require multiple ops receiving data.  (It may be that solving the
send side generically could be used for receives, however.)
Actions #7

Updated by Alex Elder about 11 years ago

Taking all of the above into account, I came up with a possible
solution.  I don't like its aesthetics, but I think it could work
for this specific purpose.

The kernel rbd client doesn't use the "middle" portion of a message.
That portion is allowed to be any length, as long as it's a
contiguous block of data.  It is sent before the data portion,
and the copyup operation might be a special case that looks there
for its data.

So if we had a single virtually contiguous buffer containing the
data read from the parent image we could supply that to the ceph
messenger as a ceph buffer representing the middle portion of the

But as I mentioned in my last post, I now know that the "copyup" 
operation is implemented as a method call, so I need to focus a
little more into how those are handled and whether that can be
expanded to handle either bios or buffers.

So I'll do that next.
Actions #8

Updated by Alex Elder about 11 years ago

The way the osd client handles an object class method right now
assumes that outbound data (headed from the client to the osd)
is added to the trail portion of the message. This involves
copying the data into pages in the trail pagelist, adding new
pages as necessary.

For the huge data transfers we're doing here that won't do,
and I've created to suggest
allowing a page array to be supplied and used instead.

There is another problem with this, but it's not new. The trail
by definition is the end of the data portion of the message. This
imposes ordering constraints on the ops that are contained in
the op array in an osd request. Specifically, a WRITE op cannot
follow a CALL, because the data for the WRITE will get sent
at the beginning of the data portion of the message, while the
data for the CALL will be in the trail, necessarily following
that. There are other restrictions, e.g. you couldn't put an
XATTR or NOTIFY operation after a WRITE in the same osd request
(if one ever wanted to do that).

As I said, this is an existing problem, a consequence of creating
this notion of "trailing" data.

This CALL followed by WRITE that doesn't work is precisely what
we need to occur for a layered write. What's needed might need
to affect the way that more than just a CALL operation is
represented in the osd client.

Actions #9

Updated by Alex Elder about 11 years ago

After the walk through described above, I spent a little more
time thinking about the messenger aspect of allowing multiple
osd ops to provide (or receive) data in a single message.

The need to do this (or perhaps the need to both send data
and receive data for the single class method op) seems to
have been the motivation for defining the "trail" portion of
a message, but that mechanism won't work for what we need

So as expected, to support the changes needed here, the
messenger needs to have a way to provide the data for a
compound class op, which is the subject of issue 3761
and I believe a separate task as well.

It wasn't clear how to proceed with that though, so I spent
some time yesterday prototyping some changes to the messenger.
I updated with some
information about what I was trying to do.

I have some new smaller tasks to define that are more
closely associated with the messenger than the osd client
though. I'm going to create them, but I'm not sure I'll
get them connected correctly within redmine.

Actions #10

Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Sage Weil about 11 years ago

  • Status changed from Resolved to In Progress
  • Target version changed from v0.58 to v0.59
Actions #12

Updated by Ian Colle about 11 years ago

  • Target version changed from v0.59 to v0.60
Actions #13

Updated by Ian Colle about 11 years ago

  • Target version changed from v0.60 to v0.61 - Cuttlefish
Actions #14

Updated by Alex Elder about 11 years ago

OK, here are my plans for finishing this up.

First, defines work that
consolidates the code that builds osd operations. That stuff
is out for review now. But this issue depends on that so
the code that operates on ops is contained.

Next, once is complete,
a message will be able to have multiple distinct sources of
data. This goes a long way toward supporting this, so
finishing that is also prerequisite.

Once that's in place it'll be a matter of "adding" data to a
request rather than "setting" it when building an osd op.
That may be all there is to completing this issue--mostly
completing other prerequiste work. I'll see how things look
when I get this far, but I think that will be the point at
which I close this issue.

Finally, defines
how a page array needs to be used as the data for
an osd class operation. That doesn't strictly need
to be completed before this issue is considered complete,
but it is required for our final use of this feature--to
supply multiple blobs of data for a single request.

Actions #15

Updated by Sage Weil about 11 years ago

  • Target version changed from v0.61 - Cuttlefish to v0.62a
Actions #16

Updated by Alex Elder about 11 years ago

  • Status changed from In Progress to Fix Under Review

The following patch has been posted for review:

[PATCH 6/6] libceph: add, don't set data for a message

Actions #17

Updated by Alex Elder about 11 years ago

  • Status changed from Fix Under Review to Resolved

The following has been committed to the "testing" branch
of the ceph-client git repository:

436b0c0 libceph: add, don't set data for a message


Also available in: Atom PDF