Project

General

Profile

Librbd - shared flag object map » History » Version 4

Jessica Mack, 07/03/2015 08:21 PM

1 1 Jessica Mack
h1. Librbd - shared flag object map
2 3 Jessica Mack
3
h3. Summary
4
5
we need to consider to make a tradeoff between multi-client support and single-client support for librbd. In practice, most of the volumes/images are used by VM, there only exist one client will access/modify image. We can't only want to make shared image possible but make most of use cases bad. So we can add a new flag called "shared" when creating image. If "shared" is false, librbd will maintain a object map for each image. 
6
 
7
We can easily find the advantage of this feature:
8
# Avoid clone performance problem
9
# Make snapshot statistic possible
10
# Improve librbd operation performance including read, copy-on-write operation.
11
 
12
h3. Owners
13
14
* Haomai Wang (UnitedStack)
15
* Josh Durgin (Red Hat)
16
* Jason Dillaman (Red Hat)
17
18
h3. Interested Parties
19
20
* Name (Affiliation)
21
* Name (Affiliation)
22
* Name
23
24
h3. Detailed Description
25
26
For non-shared images (such as VMs), an object map will be constructed and maintained to track the current in-use state of each RADOS object within an image.  For each object within an image, the state of the object map will be either NON-EXISTENT, PENDING DELETE, or MAY EXIST.  Images can be flagged as shared during the time of creation (create, import, clone, copy) to disable the use of the new object map optimizations. 
27
 
28
IO write operations will update the object map state to MAY EXIST prior to submitting the write request to RADOS.  Since this operation will only be invoked once for a given object upon state change, the latency cost for the extra operation should be negligible. IO read operations will check the object map for MAY EXIST objects to determine if a RADOS read op is required. IO delete operations (trims, discards, etc) will bulk-update all objects flagged as PENDING DELETE or MAY EXIST to PENDING DELETE prior to submitting the delete request to RADOS, followed by updating the object map to NON-EXISTENT afterwards.  
29
 
30
The use of the object map will require an exclusive lock on the image to prevent two or more clients from manipulating the same image.  This exclusive lock will be handled as a new RBD feature bit to prevent older, incompatible clients from attempting to access an image using the new exclusive lock functionality.  The new lock will be associated with the rbd_header.<id> object for the image so that it is compatible with / subsumes the current cooperative RBD locking functionality.  The new locking functionality will also be utilized by the future RBD mirroring feature.
31
 
32
Clients attempting to perform image maintenance operations (i.e. resize, snapshot, flatten), will proxy their requests to the client currently holding the exclusive lock on the image.  This will be accomplished through the use of watch/notify events against the rbd_header.<id> object.  RBD currently uses this object to notify other clients of RBD header updates.  This functionality will be expanded to allow clients to send requests to the current exclusive lock holder. 
33
 
34 4 Jessica Mack
|+Operation+	|+Direction+	|+Notes+|
35 3 Jessica Mack
|Exclusive Lock Acquired	|Lock Owner -> Peers	|When a new client acquires the exclusive lock for an image, it will broadcast this notification to all other clients with the same image open.  This will allow other clients to gracefully retry pending requests.|
36
|Exclusive Lock Request
37
(IO write/discard ops)|Peer -> Lock Owner	|When a client needs to modify the image and another client already holds the lock to the image, the new client can send a request to the current owner to gracefully transfer the lock.  Live migration of a VM is one possible use-case.|
38
|Exclusive Lock Release	|Lock Owner -> Peers	|When the current lock owner releases the lock, it broadcasts a notification to all peers so that they can attempt to acquire the lock (if needed).|
39
|Header Update	|Peer -> Peer	|Support for the legacy header update notification|
40
|Flatten |Peer -> Lock Owner	|When a client needs to flatten an image, it will send a notification to the current lock owner requesting the flattening operation.  The lock owner will asynchronously start the flatten operation by throttling X copy-up requests -- sending new requests as the old requests complete.  Periodic progress updates and the final status will be sent to the requesting client.|
41
|Resize	|Peer -> Lock Owner|	When a client needs to resize an image, it will send a notification to the current lock owner requesting the resize operation.  The lock owner will asynchronously start to discard object (if shrinking) by throttling X discard requests -- sending new requests as the old requests complete.  Periodic progress updates and the final status will be sent to the requesting client.|
42
|Snap Create	|Peer -> Lock Owner	|When a client needs to create a snapshot, it will send a notification to the current lock owner requesting the snapshot.  The lock owner will flush its cache and create the snapshot upon request.|
43
|Snap Rollback	| 	|Support not currently planned|
44
|Async Progress Update|	Lock Owner -> Peer|	For long-running operations, the lock owner will send periodic progress updates to the requesting client.|
45
|Async Result	|Lock Owner -> Peer|	For long-running operations, the lock owner will send the final result to the requesting client.|
46
47
h3. Work items
48 1 Jessica Mack
49 4 Jessica Mack
h4. Coding tasks
50
51
# http://tracker.ceph.com/issues/8900
52
# https://github.com/ceph/ceph/compare/wip-8900 (WIP Exclusive Locking)
53
# ?http://tracker.ceph.com/issues/8901
54
# http://tracker.ceph.com/issues/8902
55
# https://github.com/ceph/ceph/compare/wip-8902 (WIP Flatten/Resize/Snapshot Proxying)
56
# http://tracker.ceph.com/issues/8903
57
# http://tracker.ceph.com/issues/4087
58
# https://github.com/ceph/ceph/compare/wip-4087 (WIP Object Map)
59
# http://tracker.ceph.com/issues/7746
60
# https://github.com/dillaman/ceph/compare/wip-7746 (WIP RBD diff Object Map optimizations)
61
62
h3. Historical Notes 
63
64 3 Jessica Mack
There exists two important things to do:
65 4 Jessica Mack
# The implementation of ObjectMap(or Index), we need to make it as durable as possible.
66
# Handle with the effect of snapshot and live-migration
67 3 Jessica Mack
 
68
By Josh:
69
I think it's a great idea! We discussed this a little at the last cds
70
[1]. I like the idea of the shared flag on an image. Since the vastly
71
more common case is single-client, I'd go further and suggest that
72
we treat images as if shared is false by default if the flag is not
73
present (perhaps with a config option to change this default behavior).
74
75
That way existing images can benefit from the feature without extra
76
configuration. There can be an rbd command to toggle the shared flag as
77
well, so users of ocfs2 or gfs2 or other multi-client-writing systems
78
can upgrade and set shared to true before restarting their clients.
79
80
Another thing to consider is the granularity of the object map. The
81
coarse granularity of a bitmap of object existence would be simplest,
82
and most useful for in-memory comparison for clones. For statistics
83
it might be desirable in the future to have a finer-grained index of
84
data existence in the image. To make that easy to handle, the on-disk
85
format could be a list of extents (byte ranges).
86
87
Another potential use case would be a mode in which the index is
88
treated as authoritative. This could make discard very fast, for
89
example. I'm not sure it could be done safely with only binary
90
'exists/does not exist' information though - a third 'unknown' state
91
might be needed for some cases. If this kind of index is actually useful
92
(I'm not sure there are cases where the performance penalty would be
93
worth it), we could add a new index format if we need it.
94
95
Back to the currently proposed design, to be safe with live migration
96
we'd need to make sure the index is consistent in the destination
97
process. Using rados_notify() after we set the clean flag on the index
98
can make the destination vm re-read the index before any I/O
99
happens. This might be a good time to introduce a data payload to the
100
notify as well, so we can only re-read the index, instead of all the
101
header metadata. Rereading the index after cache invalidation and wiring
102
that up through qemu's bdrv_invalidate() would be even better.
103
Build / release tasks
104
Task 1
105
Task 2
106
Task 3
107
Documentation tasks
108
Task 1
109
Task 2
110
Task 3
111
Deprecation tasks
112
Task 1
113
Task 2
114 1 Jessica Mack
Task 3