Osd - update Transaction encoding¶
Restructure ObjectStore::Transaction for better performance and flexibility. Enable some things that KeyFileStore would like to do to avoid journaling data in some cases.
- Sage Weil (Red Hat)
- Guang Yang (Yahoo!)
- Name (Affiliation)
Current Status¶ObjectStore::Transaction encapsulate a set of changes made to the local ObjectStore (usually FileStore) as an atomic unit.
The current encoding looks something like this:
- write data X to object A in collection D
- set attribute Y on object A in collection D
- set other attribute Z on object A in collection D
- set attribute W on collection D
- set key/value U on object B in collection C
The other problem is that sequences like
- clone object A in collection D to object B in collection D
- write data X to object A in collection D
are complicated events for the backend to replay when the objects are in an unknown state (due to a failure). We do a lot of ugly tricks setting xattrs and calling fsync() to ensure ordering and prevent, say, a replayed transaction just prior to the above sequence from re-cloning A to B and polluting it with data X (that was perhaps written just prior to the crash).
Detailed Description¶Two basic proposals.
First, introduce a handle-based interface and encoding for ObjectStore::Transaction. Instead of
- t.write(coll, obj, offset, length, data)
- t.setxattr(coll, obj, name, value)
- t.setxattr(othercoll, otherobjet, name2, value2)
- int h = t.get_object(coll, obj) // returns 0
- t.write(h, offset, length, data)
- t.setxattr(h, name, value)
- int h2 = t.get_object(othercoll, otherobj) // returns 1
- t.setxattr(h2, name2, value2)
This will be an encoding change that is not backwards incompatible. We will need to encode the old format when necessary for a mixed version cluster.
For the code transition, we have two options:
- preserve the old API and implicitly to opens in the Transaction. users can be switched to use the new API over time. there will be no forcing function and it may take a while.
- update all callers to use the new API immediately. more work up front, but get full benefit immediately. may be prone to merge conflicts as this work is done.
- t.write(coll, obj, off, len, data)
- int h = t.get_objct(coll, obj); // returns 0
- int b = t.add_buffer(data); // returns 0
- t.write(h, off, len, b);
- look, the buffer is big (say > 2M). let me write it to a separate fresh file and fsync that. i'll also annotate the transaction to indicate where I wrote it. when I journal it, I will skip writing the data portion twice.
That is, we can do some metadata-only journaling. The way things are currently structured, we would have to interpret each event in teh transaction, and rewrite events with a new special op to indicate what we did. With this change, the apply code can interpret the annotation and act accordingly, with basically two behaviors: either the write has the data buffer explicitly, or has an annotation indicating where it is already ondisk.
Note that in order to take advantage of this effectively on filesystems like XFS, we may need to indicate in the buffer metadata whether this is a fresh object (complete overwrite), where KeyFileStore can simply point the object metadata at a new backing file. Or, though, we may make the metadata representation rich enough that it can reference different backing files for different ranges, in which case the heuristic could be as simple as whether the data buffer is big or not (and block aligned, perhaps). If not, we would just fall back to data journaling (as we would in general for small writes).
Third, make the transaction encoding encapsulate each op with a length so that we can skip ops we don't understand.
Fourth, possibly use fixed-length struct for each op since the variable length bits (object, data) are mostly called out. If we use buffers for attr names and values (or just names, and keep them short) that might speed things up?
- update Transaction API for object and collection handles
- include glue to support old + new interfaces.
- write alternate new encoding methods (based on a new feature bit)
- write glue decoding helpers that handle the new encoding
- ?this will let FileStore, MemStore, KeyValueStore work unmodified
- update FileStore to handle the new encoding explicitly
- this will let it avoid the dup FDCache stuff
- MemStore to handle new encoding (it will go a bit faster)
- update Transaction API for new buffer handles
- glue to support old + new interfaces
- update encoding, decoding helpers
- start KeyFileStore prototype!