Osd - update Transaction encoding


Restructure ObjectStore::Transaction for better performance and flexibility. Enable some things that KeyFileStore would like to do to avoid journaling data in some cases.


  • Sage Weil (Red Hat)
  • Name

Interested Parties

  • Guang Yang (Yahoo!)
  • Name (Affiliation)
  • Name

Current Status

ObjectStore::Transaction encapsulate a set of changes made to the local ObjectStore (usually FileStore) as an atomic unit.
The current encoding looks something like this:
  1. write data X to object A in collection D
  2. set attribute Y on object A in collection D
  3. set other attribute Z on object A in collection D
  4. set attribute W on collection D
  5. set key/value U on object B in collection C
The main point being that object names (A, B) and collection names (C, D) are repeatedly encoded for each individual operation. When the transaction is being applied, each step has to repeat a lookup (into the FDCache, these days) of the object and collection name (arbitrary strings). This is expensive and incurs a performance penalty.
The other problem is that sequences like
  1. clone object A in collection D to object B in collection D
  2. write data X to object A in collection D
  3. ...

are complicated events for the backend to replay when the objects are in an unknown state (due to a failure). We do a lot of ugly tricks setting xattrs and calling fsync() to ensure ordering and prevent, say, a replayed transaction just prior to the above sequence from re-cloning A to B and polluting it with data X (that was perhaps written just prior to the crash).

Detailed Description

Two basic proposals.
First, introduce a handle-based interface and encoding for ObjectStore::Transaction. Instead of
  1. t.write(coll, obj, offset, length, data)
  2. t.setxattr(coll, obj, name, value)
  3. t.setxattr(othercoll, otherobjet, name2, value2)
we would instead do something like
  1. int h = t.get_object(coll, obj) // returns 0
  2. t.write(h, offset, length, data)
  3. t.setxattr(h, name, value)
  4. int h2 = t.get_object(othercoll, otherobj) // returns 1
  5. t.setxattr(h2, name2, value2)
The encoding would change accordingly. This means that on the backend the code can do a single lookup on object/collection and all operations will reference it directly (using nice, small integers that index into a short vector).
This will be an encoding change that is not backwards incompatible. We will need to encode the old format when necessary for a mixed version cluster.
For the code transition, we have two options:
  1. preserve the old API and implicitly to opens in the Transaction. users can be switched to use the new API over time. there will be no forcing function and it may take a while.
  2. update all callers to use the new API immediately. more work up front, but get full benefit immediately. may be prone to merge conflicts as this work is done.
Second, separate buffers out explicitly from operations that use them. For example, instead of
  1. t.write(coll, obj, off, len, data)
we would do
  1. int h = t.get_objct(coll, obj); // returns 0
  2. int b = t.add_buffer(data); // returns 0
  3. t.write(h, off, len, b);
The advantage of doing this is on the backend. Without interpreting the various operations, there will be an explicit view of what data blogs are present and how big they are. It (KeyFileStore, in particular) can then do things like:
  • look, the buffer is big (say > 2M). let me write it to a separate fresh file and fsync that. i'll also annotate the transaction to indicate where I wrote it. when I journal it, I will skip writing the data portion twice.

That is, we can do some metadata-only journaling. The way things are currently structured, we would have to interpret each event in teh transaction, and rewrite events with a new special op to indicate what we did. With this change, the apply code can interpret the annotation and act accordingly, with basically two behaviors: either the write has the data buffer explicitly, or has an annotation indicating where it is already ondisk.
Note that in order to take advantage of this effectively on filesystems like XFS, we may need to indicate in the buffer metadata whether this is a fresh object (complete overwrite), where KeyFileStore can simply point the object metadata at a new backing file. Or, though, we may make the metadata representation rich enough that it can reference different backing files for different ranges, in which case the heuristic could be as simple as whether the data buffer is big or not (and block aligned, perhaps). If not, we would just fall back to data journaling (as we would in general for small writes).
Third, make the transaction encoding encapsulate each op with a length so that we can skip ops we don't understand.
Fourth, possibly use fixed-length struct for each op since the variable length bits (object, data) are mostly called out. If we use buffers for attr names and values (or just names, and keep them short) that might speed things up?

Work items

Coding tasks

  1. update Transaction API for object and collection handles
  2. include glue to support old + new interfaces.
  3. write alternate new encoding methods (based on a new feature bit)
  4. write glue decoding helpers that handle the new encoding
    1. ?this will let FileStore, MemStore, KeyValueStore work unmodified
  5. update FileStore to handle the new encoding explicitly
    1. this will let it avoid the dup FDCache stuff
  6. MemStore to handle new encoding (it will go a bit faster)
  7. update Transaction API for new buffer handles
  8. glue to support old + new interfaces
  9. update encoding, decoding helpers
  10. start KeyFileStore prototype!