Project

General

Profile

Librbd - shared flag object map » History » Revision 3

Revision 2 (Jessica Mack, 07/03/2015 08:15 PM) → Revision 3/6 (Jessica Mack, 07/03/2015 08:19 PM)

h1. Librbd - shared flag object map 

 h3. Summary 

 we need to consider to make a tradeoff between multi-client support and single-client support for librbd. In practice, most of the volumes/images are used by VM, there only exist one client will access/modify image. We can't only want to make shared image possible but make most of use cases bad. So we can add a new flag called "shared" when creating image. If "shared" is false, librbd will maintain a object map for each image.  
 
 We can easily find the advantage of this feature: 
 # Avoid clone performance problem 
 # Make snapshot statistic possible 
 # Improve librbd operation performance including read, copy-on-write operation. 
 
 h3. Owners 

 * Haomai Wang (UnitedStack) 
 * Josh Durgin (Red Hat) 
 * Jason Dillaman (Red Hat) 

 h3. Interested Parties 

 * Name (Affiliation) 
 * Name (Affiliation) 
 * Name 

 h3. Detailed Description 

 For non-shared images (such as VMs), an object map will be constructed and maintained to track the current in-use state of each RADOS object within an image.    For each object within an image, the state of the object map will be either NON-EXISTENT, PENDING DELETE, or MAY EXIST.    Images can be flagged as shared during the time of creation (create, import, clone, copy) to disable the use of the new object map optimizations.  
 
 IO write operations will update the object map state to MAY EXIST prior to submitting the write request to RADOS.    Since this operation will only be invoked once for a given object upon state change, the latency cost for the extra operation should be negligible. IO read operations will check the object map for MAY EXIST objects to determine if a RADOS read op is required. IO delete operations (trims, discards, etc) will bulk-update all objects flagged as PENDING DELETE or MAY EXIST to PENDING DELETE prior to submitting the delete request to RADOS, followed by updating the object map to NON-EXISTENT afterwards.   
 
 The use of the object map will require an exclusive lock on the image to prevent two or more clients from manipulating the same image.    This exclusive lock will be handled as a new RBD feature bit to prevent older, incompatible clients from attempting to access an image using the new exclusive lock functionality.    The new lock will be associated with the rbd_header.<id> object for the image so that it is compatible with / subsumes the current cooperative RBD locking functionality.    The new locking functionality will also be utilized by the future RBD mirroring feature. 
 
 Clients attempting to perform image maintenance operations (i.e. resize, snapshot, flatten), will proxy their requests to the client currently holding the exclusive lock on the image.    This will be accomplished through the use of watch/notify events against the rbd_header.<id> object.    RBD currently uses this object to notify other clients of RBD header updates.    This functionality will be expanded to allow clients to send requests to the current exclusive lock holder.  
 
 |Operation 	 |Direction 	 |Notes| 
 |Exclusive Lock Acquired 	 |Lock Owner -> Peers 	 |When a new client acquires the exclusive lock for an image, it will broadcast this notification to all other clients with the same image open.    This will allow other clients to gracefully retry pending requests.| 
 |Exclusive Lock Request 
 (IO write/discard ops)|Peer -> Lock Owner 	 |When a client needs to modify the image and another client already holds the lock to the image, the new client can send a request to the current owner to gracefully transfer the lock.    Live migration of a VM is one possible use-case.| 
 |Exclusive Lock Release 	 |Lock Owner -> Peers 	 |When the current lock owner releases the lock, it broadcasts a notification to all peers so that they can attempt to acquire the lock (if needed).| 
 |Header Update 	 |Peer -> Peer 	 |Support for the legacy header update notification| 
 |Flatten |Peer -> Lock Owner 	 |When a client needs to flatten an image, it will send a notification to the current lock owner requesting the flattening operation.    The lock owner will asynchronously start the flatten operation by throttling X copy-up requests -- sending new requests as the old requests complete.    Periodic progress updates and the final status will be sent to the requesting client.| 
 |Resize 	 |Peer -> Lock Owner| 	 When a client needs to resize an image, it will send a notification to the current lock owner requesting the resize operation.    The lock owner will asynchronously start to discard object (if shrinking) by throttling X discard requests -- sending new requests as the old requests complete.    Periodic progress updates and the final status will be sent to the requesting client.| 
 |Snap Create 	 |Peer -> Lock Owner 	 |When a client needs to create a snapshot, it will send a notification to the current lock owner requesting the snapshot.    The lock owner will flush its cache and create the snapshot upon request.| 
 |Snap Rollback 	 |  	 |Support not currently planned| 
 |Async Progress Update| 	 Lock Owner -> Peer| 	 For long-running operations, the lock owner will send periodic progress updates to the requesting client.| 
 |Async Result 	 |Lock Owner -> Peer| 	 For long-running operations, the lock owner will send the final result to the requesting client.| 

 h3. Work items 

 Coding tasks 
 http://tracker.ceph.com/issues/8900 
 https://github.com/ceph/ceph/compare/wip-8900 (WIP Exclusive Locking) 
 ​http://tracker.ceph.com/issues/8901 
 http://tracker.ceph.com/issues/8902 
 https://github.com/ceph/ceph/compare/wip-8902 (WIP Flatten/Resize/Snapshot Proxying) 
 http://tracker.ceph.com/issues/8903 
 http://tracker.ceph.com/issues/4087 
 https://github.com/ceph/ceph/compare/wip-4087 (WIP Object Map) 
 http://tracker.ceph.com/issues/7746 
 https://github.com/dillaman/ceph/compare/wip-7746 (WIP RBD diff Object Map optimizations) 
 Historical Notes  
 There exists two important things to do: 
 1. The implementation of ObjectMap(or Index), we need to make it as durable as possible. 
 2. Handle with the effect of snapshot and live-migration 
 
 By Josh: 
 I think it's a great idea! We discussed this a little at the last cds 
 [1]. I like the idea of the shared flag on an image. Since the vastly 
 more common case is single-client, I'd go further and suggest that 
 we treat images as if shared is false by default if the flag is not 
 present (perhaps with a config option to change this default behavior). 

 That way existing images can benefit from the feature without extra 
 configuration. There can be an rbd command to toggle the shared flag as 
 well, so users of ocfs2 or gfs2 or other multi-client-writing systems 
 can upgrade and set shared to true before restarting their clients. 

 Another thing to consider is the granularity of the object map. The 
 coarse granularity of a bitmap of object existence would be simplest, 
 and most useful for in-memory comparison for clones. For statistics 
 it might be desirable in the future to have a finer-grained index of 
 data existence in the image. To make that easy to handle, the on-disk 
 format could be a list of extents (byte ranges). 

 Another potential use case would be a mode in which the index is 
 treated as authoritative. This could make discard very fast, for 
 example. I'm not sure it could be done safely with only binary 
 'exists/does not exist' information though - a third 'unknown' state 
 might be needed for some cases. If this kind of index is actually useful 
 (I'm not sure there are cases where the performance penalty would be 
 worth it), we could add a new index format if we need it. 

 Back to the currently proposed design, to be safe with live migration 
 we'd need to make sure the index is consistent in the destination 
 process. Using rados_notify() after we set the clean flag on the index 
 can make the destination vm re-read the index before any I/O 
 happens. This might be a good time to introduce a data payload to the 
 notify as well, so we can only re-read the index, instead of all the 
 header metadata. Rereading the index after cache invalidation and wiring 
 that up through qemu's bdrv_invalidate() would be even better. 
 Build / release tasks 
 Task 1 
 Task 2 
 Task 3 
 Documentation tasks 
 Task 1 
 Task 2 
 Task 3 
 Deprecation tasks 
 Task 1 
 Task 2 
 Task 3