Project

General

Profile

Librbd - shared flag object map

Summary

we need to consider to make a tradeoff between multi-client support and single-client support for librbd. In practice, most of the volumes/images are used by VM, there only exist one client will access/modify image. We can't only want to make shared image possible but make most of use cases bad. So we can add a new flag called "shared" when creating image. If "shared" is false, librbd will maintain a object map for each image.

We can easily find the advantage of this feature:
  1. Avoid clone performance problem
  2. Make snapshot statistic possible
  3. Improve librbd operation performance including read, copy-on-write operation.

Owners

  • Haomai Wang (UnitedStack)
  • Josh Durgin (Red Hat)
  • Jason Dillaman (Red Hat)

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Detailed Description

For non-shared images (such as VMs), an object map will be constructed and maintained to track the current in-use state of each RADOS object within an image. For each object within an image, the state of the object map will be either NON-EXISTENT, PENDING DELETE, or MAY EXIST. Images can be flagged as shared during the time of creation (create, import, clone, copy) to disable the use of the new object map optimizations.

IO write operations will update the object map state to MAY EXIST prior to submitting the write request to RADOS. Since this operation will only be invoked once for a given object upon state change, the latency cost for the extra operation should be negligible. IO read operations will check the object map for MAY EXIST objects to determine if a RADOS read op is required. IO delete operations (trims, discards, etc) will bulk-update all objects flagged as PENDING DELETE or MAY EXIST to PENDING DELETE prior to submitting the delete request to RADOS, followed by updating the object map to NON-EXISTENT afterwards.

The use of the object map will require an exclusive lock on the image to prevent two or more clients from manipulating the same image. This exclusive lock will be handled as a new RBD feature bit to prevent older, incompatible clients from attempting to access an image using the new exclusive lock functionality. The new lock will be associated with the rbd_header.<id> object for the image so that it is compatible with / subsumes the current cooperative RBD locking functionality. The new locking functionality will also be utilized by the future RBD mirroring feature.

Clients attempting to perform image maintenance operations (i.e. resize, snapshot, flatten), will proxy their requests to the client currently holding the exclusive lock on the image. This will be accomplished through the use of watch/notify events against the rbd_header.<id> object. RBD currently uses this object to notify other clients of RBD header updates. This functionality will be expanded to allow clients to send requests to the current exclusive lock holder.

Operation Direction Notes
Exclusive Lock Acquired Lock Owner -> Peers When a new client acquires the exclusive lock for an image, it will broadcast this notification to all other clients with the same image open.
This will allow other clients to gracefully retry pending requests.
Exclusive Lock Request
(IO write/discard ops)
Peer -> Lock Owner When a client needs to modify the image and another client already holds the lock to the image, the new client can send a request to the current owner to gracefully transfer the lock. Live migration of a VM is one possible use-case.
Exclusive Lock Release Lock Owner -> Peers When the current lock owner releases the lock, it broadcasts a notification to all peers so that they can attempt to acquire the lock (if needed).
Header Update Peer -> Peer Support for the legacy header update notification
Flatten Peer -> Lock Owner When a client needs to flatten an image, it will send a notification to the current lock owner requesting the flattening operation.
The lock owner will asynchronously start the flatten operation by throttling X copy-up requests -- sending new requests as the old requests complete.Periodic progress updates and the final status will be sent to the requesting client.
Resize Peer -> Lock Owner When a client needs to resize an image, it will send a notification to the current lock owner requesting the resize operation.
The lock owner will asynchronously start to discard object (if shrinking) by throttling X discard requests -- sending new requests as the old requests complete. Periodic progress updates and the final status will be sent to the requesting client.
Snap Create Peer -> Lock Owner When a client needs to create a snapshot, it will send a notification to the current lock owner requesting the snapshot.
The lock owner will flush its cache and create the snapshot upon request.
Snap Rollback Support not currently planned
Async Progress Update Lock Owner -> Peer For long-running operations, the lock owner will send periodic progress updates to the requesting client.
Async Result Lock Owner -> Peer For long-running operations, the lock owner will send the final result to the requesting client.

Work items

Coding tasks

  1. http://tracker.ceph.com/issues/8900
  2. http://tracker.ceph.com/issues/8901
  3. http://tracker.ceph.com/issues/8902
  4. http://tracker.ceph.com/issues/8903
  5. http://tracker.ceph.com/issues/4087
  6. http://tracker.ceph.com/issues/7746

Historical Notes

There exists two important things to do:
  1. The implementation of ObjectMap(or Index), we need to make it as durable as possible.
  2. Handle with the effect of snapshot and live-migration

By Josh:
I think it's a great idea! We discussed this a little at the last cds
[1]. I like the idea of the shared flag on an image. Since the vastly
more common case is single-client, I'd go further and suggest that
we treat images as if shared is false by default if the flag is not
present (perhaps with a config option to change this default behavior).

That way existing images can benefit from the feature without extra
configuration. There can be an rbd command to toggle the shared flag as
well, so users of ocfs2 or gfs2 or other multi-client-writing systems
can upgrade and set shared to true before restarting their clients.

Another thing to consider is the granularity of the object map. The
coarse granularity of a bitmap of object existence would be simplest,
and most useful for in-memory comparison for clones. For statistics
it might be desirable in the future to have a finer-grained index of
data existence in the image. To make that easy to handle, the on-disk
format could be a list of extents (byte ranges).

Another potential use case would be a mode in which the index is
treated as authoritative. This could make discard very fast, for
example. I'm not sure it could be done safely with only binary
'exists/does not exist' information though - a third 'unknown' state
might be needed for some cases. If this kind of index is actually useful
(I'm not sure there are cases where the performance penalty would be
worth it), we could add a new index format if we need it.

Back to the currently proposed design, to be safe with live migration
we'd need to make sure the index is consistent in the destination
process. Using rados_notify() after we set the clean flag on the index
can make the destination vm re-read the index before any I/O
happens. This might be a good time to introduce a data payload to the
notify as well, so we can only re-read the index, instead of all the
header metadata. Rereading the index after cache invalidation and wiring
that up through qemu's bdrv_invalidate() would be even better.

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3