Project

General

Profile

Librbd - shared flag object map » History » Version 5

Jessica Mack, 07/03/2015 08:26 PM

1 1 Jessica Mack
h1. Librbd - shared flag object map
2 3 Jessica Mack
3
h3. Summary
4
5
we need to consider to make a tradeoff between multi-client support and single-client support for librbd. In practice, most of the volumes/images are used by VM, there only exist one client will access/modify image. We can't only want to make shared image possible but make most of use cases bad. So we can add a new flag called "shared" when creating image. If "shared" is false, librbd will maintain a object map for each image. 
6
 
7
We can easily find the advantage of this feature:
8
# Avoid clone performance problem
9
# Make snapshot statistic possible
10
# Improve librbd operation performance including read, copy-on-write operation.
11
 
12
h3. Owners
13
14
* Haomai Wang (UnitedStack)
15
* Josh Durgin (Red Hat)
16
* Jason Dillaman (Red Hat)
17
18
h3. Interested Parties
19
20
* Name (Affiliation)
21
* Name (Affiliation)
22
* Name
23
24
h3. Detailed Description
25
26
For non-shared images (such as VMs), an object map will be constructed and maintained to track the current in-use state of each RADOS object within an image.  For each object within an image, the state of the object map will be either NON-EXISTENT, PENDING DELETE, or MAY EXIST.  Images can be flagged as shared during the time of creation (create, import, clone, copy) to disable the use of the new object map optimizations. 
27
 
28
IO write operations will update the object map state to MAY EXIST prior to submitting the write request to RADOS.  Since this operation will only be invoked once for a given object upon state change, the latency cost for the extra operation should be negligible. IO read operations will check the object map for MAY EXIST objects to determine if a RADOS read op is required. IO delete operations (trims, discards, etc) will bulk-update all objects flagged as PENDING DELETE or MAY EXIST to PENDING DELETE prior to submitting the delete request to RADOS, followed by updating the object map to NON-EXISTENT afterwards.  
29
 
30
The use of the object map will require an exclusive lock on the image to prevent two or more clients from manipulating the same image.  This exclusive lock will be handled as a new RBD feature bit to prevent older, incompatible clients from attempting to access an image using the new exclusive lock functionality.  The new lock will be associated with the rbd_header.<id> object for the image so that it is compatible with / subsumes the current cooperative RBD locking functionality.  The new locking functionality will also be utilized by the future RBD mirroring feature.
31
 
32
Clients attempting to perform image maintenance operations (i.e. resize, snapshot, flatten), will proxy their requests to the client currently holding the exclusive lock on the image.  This will be accomplished through the use of watch/notify events against the rbd_header.<id> object.  RBD currently uses this object to notify other clients of RBD header updates.  This functionality will be expanded to allow clients to send requests to the current exclusive lock holder. 
33
 
34 4 Jessica Mack
|+Operation+	|+Direction+	|+Notes+|
35 5 Jessica Mack
|Exclusive Lock Acquired	|Lock Owner -> Peers	|When a new client acquires the exclusive lock for an image, it will broadcast this notification to all other clients with the same image open.  
36
This will allow other clients to gracefully retry pending requests.|
37 1 Jessica Mack
|Exclusive Lock Request
38 5 Jessica Mack
(IO write/discard ops)|Peer -> Lock Owner	|When a client needs to modify the image and another client already holds the lock to the image, the new client can send a request to the current owner to gracefully transfer the lock.
39
Live migration of a VM is one possible use-case.|
40 1 Jessica Mack
|Exclusive Lock Release	|Lock Owner -> Peers	|When the current lock owner releases the lock, it broadcasts a notification to all peers so that they can attempt to acquire the lock (if needed).|
41
|Header Update	|Peer -> Peer	|Support for the legacy header update notification|
42 5 Jessica Mack
|Flatten |Peer -> Lock Owner	|When a client needs to flatten an image, it will send a notification to the current lock owner requesting the flattening operation. 
43
The lock owner will asynchronously start the flatten operation by throttling X copy-up requests -- sending new requests as the old requests complete.
44
Periodic progress updates and the final status will be sent to the requesting client.|
45
|Resize	|Peer -> Lock Owner|	When a client needs to resize an image, it will send a notification to the current lock owner requesting the resize operation. 
46
The lock owner will asynchronously start to discard object (if shrinking) by throttling X discard requests -- sending new requests as the old requests complete. 
47
Periodic progress updates and the final status will be sent to the requesting client.|
48
|Snap Create	|Peer -> Lock Owner	|When a client needs to create a snapshot, it will send a notification to the current lock owner requesting the snapshot. 
49
The lock owner will flush its cache and create the snapshot upon request.|
50 3 Jessica Mack
|Snap Rollback	| 	|Support not currently planned|
51 4 Jessica Mack
|Async Progress Update|	Lock Owner -> Peer|	For long-running operations, the lock owner will send periodic progress updates to the requesting client.|
52 1 Jessica Mack
|Async Result	|Lock Owner -> Peer|	For long-running operations, the lock owner will send the final result to the requesting client.|
53
54
h3. Work items
55 4 Jessica Mack
56 1 Jessica Mack
h4. Coding tasks
57
58 4 Jessica Mack
# http://tracker.ceph.com/issues/8900
59 5 Jessica Mack
** https://github.com/ceph/ceph/compare/wip-8900 (WIP Exclusive Locking)
60
# http://tracker.ceph.com/issues/8901
61 4 Jessica Mack
# http://tracker.ceph.com/issues/8902
62 5 Jessica Mack
** https://github.com/ceph/ceph/compare/wip-8902 (WIP Flatten/Resize/Snapshot Proxying)
63 4 Jessica Mack
# http://tracker.ceph.com/issues/8903
64
# http://tracker.ceph.com/issues/4087
65 5 Jessica Mack
** https://github.com/ceph/ceph/compare/wip-4087 (WIP Object Map)
66 4 Jessica Mack
# http://tracker.ceph.com/issues/7746
67 5 Jessica Mack
** https://github.com/dillaman/ceph/compare/wip-7746 (WIP RBD diff Object Map optimizations)
68 4 Jessica Mack
69
h3. Historical Notes 
70
71 3 Jessica Mack
There exists two important things to do:
72 4 Jessica Mack
# The implementation of ObjectMap(or Index), we need to make it as durable as possible.
73
# Handle with the effect of snapshot and live-migration
74 3 Jessica Mack
 
75
By Josh:
76
I think it's a great idea! We discussed this a little at the last cds
77
[1]. I like the idea of the shared flag on an image. Since the vastly
78
more common case is single-client, I'd go further and suggest that
79
we treat images as if shared is false by default if the flag is not
80
present (perhaps with a config option to change this default behavior).
81
82
That way existing images can benefit from the feature without extra
83
configuration. There can be an rbd command to toggle the shared flag as
84
well, so users of ocfs2 or gfs2 or other multi-client-writing systems
85
can upgrade and set shared to true before restarting their clients.
86
87
Another thing to consider is the granularity of the object map. The
88
coarse granularity of a bitmap of object existence would be simplest,
89
and most useful for in-memory comparison for clones. For statistics
90
it might be desirable in the future to have a finer-grained index of
91
data existence in the image. To make that easy to handle, the on-disk
92
format could be a list of extents (byte ranges).
93 1 Jessica Mack
94
Another potential use case would be a mode in which the index is
95
treated as authoritative. This could make discard very fast, for
96
example. I'm not sure it could be done safely with only binary
97
'exists/does not exist' information though - a third 'unknown' state
98
might be needed for some cases. If this kind of index is actually useful
99
(I'm not sure there are cases where the performance penalty would be
100
worth it), we could add a new index format if we need it.
101
102
Back to the currently proposed design, to be safe with live migration
103
we'd need to make sure the index is consistent in the destination
104
process. Using rados_notify() after we set the clean flag on the index
105
can make the destination vm re-read the index before any I/O
106 3 Jessica Mack
happens. This might be a good time to introduce a data payload to the
107
notify as well, so we can only re-read the index, instead of all the
108
header metadata. Rereading the index after cache invalidation and wiring
109
that up through qemu's bdrv_invalidate() would be even better.
110 5 Jessica Mack
111
h4. Build / release tasks
112
113
# Task 1
114
# Task 2
115
# Task 3
116
117
h4. Documentation tasks
118
119
# Task 1
120
# Task 2
121
# Task 3
122
123
h4. Deprecation tasks
124
125
# Task 1
126
# Task 2
127
# Task 3