Project

General

Profile

Osd - update Transaction encoding » History » Version 1

Jessica Mack, 07/03/2015 09:12 PM

1 1 Jessica Mack
h1. Osd - update Transaction encoding
2
3
h3. Summary
4
5
Restructure ObjectStore::Transaction for better performance and flexibility.  Enable some things that KeyFileStore would like to do to avoid journaling data in some cases. 
6
7
h3. Owners
8
9
* Sage Weil (Red Hat)
10
* Name
11
12
h3. Interested Parties
13
14
* Guang Yang (Yahoo!)
15
* Name (Affiliation)
16
* Name
17
18
h3. Current Status
19
20
ObjectStore::Transaction encapsulate a set of changes made to the local ObjectStore (usually FileStore) as an atomic unit.
21
The current encoding looks something like this:
22
# write data X to object A in collection D
23
# set attribute Y on object A in collection D
24
# set other attribute Z on object A in collection D
25
# set attribute W on collection D
26
# set key/value U on object B in collection C
27
28
The main point being that object names (A, B) and collection names (C, D) are repeatedly encoded for each individual operation.  When the transaction is being applied, each step has to repeat a lookup (into the FDCache, these days) of the object and collection name (arbitrary strings).  This is expensive and incurs a performance penalty.
29
The other problem is that sequences like
30
# clone object A in collection D to object B in collection D
31
# write data X to object A in collection D
32
# ...
33
34
are complicated events for the backend to replay when the objects are in an unknown state (due to a failure).  We do a lot of ugly tricks setting xattrs and calling fsync() to ensure ordering and prevent, say, a replayed transaction just prior to the above sequence from re-cloning A to B and polluting it with data X (that was perhaps written just prior to the crash).
35
36
h3. Detailed Description
37
38
Two basic proposals.
39
*First*, introduce a handle-based interface and encoding for ObjectStore::Transaction.  Instead of
40
# t.write(coll, obj, offset, length, data)
41
# t.setxattr(coll, obj, name, value)
42
# t.setxattr(othercoll, otherobjet, name2, value2)
43
44
we would instead do something like
45
# int h = t.get_object(coll, obj)  // returns 0
46
# t.write(h, offset, length, data)
47
# t.setxattr(h, name, value)
48
# int h2 = t.get_object(othercoll, otherobj) // returns 1
49
# t.setxattr(h2, name2, value2)
50
51
The encoding would change accordingly.  This means that on the backend the code can do a *single* lookup on object/collection and all operations will reference it directly (using nice, small integers that index into a short vector).
52
This will be an encoding change that is not backwards incompatible. We will need to encode the old format when necessary for a mixed version cluster.
53
For the code transition, we have two options:
54
# preserve the old API and implicitly to opens in the Transaction.  users can be switched to use the new API over time.  there will be no forcing function and it may take a while.
55
# update all callers to use the new API immediately.  more work up front, but get full benefit immediately.  may be prone to merge conflicts as this work is done.
56
57
*Second*, separate buffers out explicitly from operations that use them.  For example, instead of
58
# t.write(coll, obj, off, len, data)
59
60
we would do
61
# int h = t.get_objct(coll, obj);  // returns 0
62
# int b = t.add_buffer(data);  // returns 0
63
# t.write(h, off, len, b);
64
65
The advantage of doing this is on the backend.  Without interpreting the various operations, there will be an explicit view of what data blogs are present and how big they are.  It (KeyFileStore, in particular) can then do things like:
66
* look, the buffer is big (say > 2M).  let me write it to a separate fresh file and fsync that.  i'll also annotate the transaction to indicate where I wrote it. when I journal it, I will skip writing the data portion twice.
67
68
That is, we can do some metadata-only journaling.  The way things are currently structured, we would have to interpret each event in teh transaction, and rewrite events with a new special op to indicate what we did.  With this change, the apply code can interpret the annotation and act accordingly, with basically two behaviors: either the write has the data buffer explicitly, or has an annotation indicating where it is already ondisk.
69
Note that in order to take advantage of this effectively on filesystems like XFS, we may need to indicate in the buffer metadata whether this is a fresh object (complete overwrite), where KeyFileStore can simply point the object metadata at a new backing file.  Or, though, we may make the metadata representation rich enough that it can reference different backing files for different ranges, in which case the heuristic could be as simple as whether the data buffer is big or not (and block aligned, perhaps).  If not, we would just fall back to data journaling (as we would in general for small writes).
70
*Third*, make the transaction encoding encapsulate each op with a length so that we can skip ops we don't understand.
71
*Fourth*, possibly use fixed-length struct for each op since the variable length bits (object, data) are mostly called out.  If we use buffers for attr names and values (or just names, and keep them short) that might speed things up?
72
73
h3. Work items
74
75
h4. Coding tasks
76
77
# update Transaction API for object and collection handles
78
# include glue to support old + new interfaces. 
79
# write alternate new encoding methods (based on a new feature bit)
80
# write glue decoding helpers that handle the new encoding
81
## ​this will let FileStore, MemStore, KeyValueStore work unmodified
82
# update FileStore to handle the new encoding explicitly
83
## this will let it avoid the dup FDCache stuff
84
# MemStore to handle new encoding (it will go a bit faster)
85
# update Transaction API for new buffer handles
86
# glue to support old + new interfaces
87
# update encoding, decoding helpers
88
# start KeyFileStore prototype!