Project

General

Profile

RGW Object Versioning » History » Version 1

Jessica Mack, 06/01/2015 08:34 PM

1 1 Jessica Mack
h1. RGW Object Versioning
2
3
One of the next features that we're working on is the long due object versioning. This basically allows keeping old versions of objects inside buckets, even if user has removed or overwritten them. Any object instance is immutable. and object can then be fetched by the version (instance) id of that object.
4
 
5
When removing the object without specifying a version, a new deletion marker is created. It is, however, possible to remove a specific object version, and in this case the version is not accessible anymore. What complicates things is that if the current object's version (the one that is accessed when accessing the object without specifying a version) is removed, then the object will then point at its previous version. Permissions are set on the object version level.
6
 
7
Reading of an object that was removed without specifying the version id, then returns a 404. In that case the underlying object instance is not removed, and instead a new deletion marker is created. The object logical head then moves to point at the marker. If, however, you remove the object version itself (by specifying the version id), then the object logical head will move to the previous version.  Another requirement is the ability to list all objects and versions of the objects. This means that when listing objects we either need to list only the current objects, or both the current objects and their respective versions.
8
  
9
One thing to note is that object versioning needs to be switched on for the bucket for the feature to be activated, and once it's switched on it can only be suspended. This means that newly created objects will not be versioned, but old versions will still be accessible.
10
 
11
Let's sum up the functionality:
12
* ability to list objects and versions
13
* ability to read specific object version
14
* ability to remove a specific object version (*)
15
* object creation / overwrite creates a new object version, object points at new instance
16
* object removal does not remove object instance, creates a deletion marker
17
* (*) removal of the current object version rolls back object to point at previous object version
18
* permissions affect the object version and can be set on the versions
19
* a GET can still be serviced by going directly to librados objects, without consulting an index (and breaking read-side bucket scalability)
20
* a bucket listing is still reasonably efficient (normally performed by consulting the index object only).
21
 
22
Now, considering this functionality, it seems that we need to deal
23
with 3 different entities:
24
* bucket index
25
* object instances (versions)
26
* object logical head (olh)
27
 
28
The first two can be mapped nicely into the already existing structures. The existing bucket index will be extended to keep the list of versions, and our current rgw objects will be used to handle the object instances, as they serve the same function. There is one differnce, though: before the head would be addressible by the object name, whereas here it is object name + tag/version, so that the heads don't collide with other object versions.
29
 
30
One of the options that we can consider for the object logical head is also to use a regular object that will just have a copy of the appropriate instance manifest. It doesn't seem that this will function as needed, as it doesn't satisfy the last requirement (permissions are set at the version level). What we do need to have is some sort of a soft link that will be used to point at the appropriate object instance.
31
 
32
We had internal discussions on how to make everything work together.  There are a few things that we need to be careful about. We need to make sure that the bucket index listing reflects the status of the actual objects. When the olh points at a specific version, we shouldn't show a different view when listing the objects. This gets even more complicated when removing an object version that requires olh change, as we have 3 different entities that we need to sync. Note that rados does not have multi-object transactions (for now), and we traditionally avoided locking for rgw object operations (those 3 entities being the index, the object version, and the olh pointer).
33
 
34
The current scheme is that we update the bucket index using a 2 phase commit, and it follows up on the objects state. So when adding / removing an object, we first tell the bucket index to 'prepare' for the operation, then do the operation, and eventually we let the bucket index know about the completion. For ordering we rely on the pg versioning system that gives us insight into the timeline, so that when two concurrent operations happen on the same object the bucket index can figure out who won and who is dead.  This system as it is doesn't really work with versioning as we have both the olh, and the object instances. This is one of the solutions that we came up with:
35
* The bucket index will be the source of the truth
36
* The bucket index will serve as an operational log for olh operations
37
 
38
The bucket index will index every object instance in reverse order (from new to old). The bucket index will keep entries for deletion markers.
39
The bucket index will also keep operations journal for olh modifications. Each operation in this journal will have an id that will be increased monotonically, and that will be tied into current olh version. The olh will be modified using idempotent operations that will be subject to having its current version smaller than the operation id.
40
The journal will be used for keeping order, and the entries in the journal will serve as a blueprint that the gateways will need to follow when applying changes. In order to ensure that operations that needed to be complete were done, we'll mark the olh before going to the bucket index, so that if the gateway died before completing the operation, next time we try to access the object we'll know that we need to go to the bucket index and complete the operation.
41
Things will then work like this:
42
43
h4. object read
44
45
# look at olh
46
# if marked as pending-modify,
47
## check index for current head version, and use that vaue
48
## if pending-modify is super old and no matching index entry exists, remove marker
49
## if index entry does exist, send async op to roll-forward the olh
50
# read referenced object version
51
 ...and the 'roll-forward' on the olh would be something like
52
 cmpxattr pending-modify-$tag == 1
53
 cmpxattr olh_version < new v
54
 setxattr olh_version = new v
55
 setxattr head_version = whatever
56
 rmxattr pending-modify-$tag
57
 This has the side-effect that a hot object will briefly pummel the index.  That is probably fine...
58
59
We also need to rmxattr pending-modify-$tag for all prior modifications that are in the index/journal at the time
60
61
h4. object creation
62
63
# Create object instance
64
# Mark olh that it's about to be modified:
65
 setxattr pending-modify-$tag=timestamp
66
 If we fail before 2 then the (partial) object version should get garbage collected.
67
# Update bucket index about new object instance
68
 omap_setkeys journal_$object_$olhversion_$tag = pending
69
# Read bucket index object op journal
70
 Note that the journal should have at this point an entry that says 'point olh to specific object version, subject to olh is at version X'.
71
# Apply journal ops:
72
 cmpxattr pending-modify-$tag == timestamp
73
 cmpxattr olh_version == $olh_version_old
74
 setxattr olh_version = $olh_version_new
75
 setxattr head_version = whatever
76
 rmxattr pending-modify-$tag
77
# Trim journal, unmark olh:
78
 call rgw.trim_journal($object, $olh_version_new)
79
80
h4. object removal (olh)
81
82
# Mark olh that it's about to be modified
83
 setxattr pending-modify-$tagthing
84
# Update bucket index about the new deletion marker
85
 omap_setkeys ...
86
# Read bucket index object op journal
87
 call rgw.describe_olh_op $bucket $object
88
 The journal entry should say something like 'mark olh as removed, subject to olh is at version X'
89
# Apply ops
90
 cmpxattr olh_version == $olh_version_old
91
 setxattrolh_version = $olh_version_new
92
 setxattr head_version = whiteout
93
 rmxattr pending-modify-$tag (for all pending tags)
94
# Trim journal, unmark olh
95
 
96
Another option is to actually remove the olh, but in this case we'll lose the olh versioning. We can in that case use the object non-existent state as a check, but that will not be enough as there are some corner cases where we could end up with the olh pointing at the wrong object.
97
98
h4. object version removal
99
100
# Mark olh as it will potentially be modified
101
 setxattr pending-modify-$tag = timestamp
102
# Update bucket index about object instance removal
103
  omap_setkeys ...
104
# Read bucket index op journal
105
 call rgw.describe_olh_op $bucket $object $tag
106
# Apply ops journal
107
 Now the journal might just say something like 'remove object instance', which means that the olh was pointing at a different object version. The more interesting case is when the olh pointing at this specific object version. In this case the journal will say something like 'first point the olh at version V2, subject to olh is at version X. Now, remove object instance V1'.
108
 cmpxattr olh_version == $olh_version_old
109
 setxattrolh_version = $olh_version_new
110
 rmxattr pending-modify-$tag (for all pending tags)
111
# Trim journal, unmark olh
112
113
Note about olh marking: The olh mark will create an attr on the olh that will have an id and a timestamp. There could be multiple marks on the olh, and the marks should have some expiration, so that operations that did not really start would be removed after a while.
114
115
There is another case here when when all versions get removed.  In that case, the final op would just remove the olh entirely.  Later, when we recreate the object, the object create would be:
116
 
117
# write object version
118
# write to journal
119
# describe olh op
120
# create/update olh
121
# trim journal