RBD - Mirroring


Create a metadata or data journal for RBD for async replication/DR purposes.


  • Josh Durgin (Red Hat)
  • Name (Affiliation)
  • Name

Interested Parties

  • Sage Weil (Inktank)
  • Haomai Wang (UnitedStack)

Current Status

You can create point-in-time snapshots of an entire image, and get a delta between two snapshots, but that delta requires a scan of all image objects to generate. This limits its usefulness for DR purposes since it is generally not practical to pay for that scan at frequent intervals.

Detailed Description

There are a few use cases to capture here:
  1. full data journal
    1. Allowing a replica image in another DC or cluster to stream updates in realtime. Because the data journal would be time-ordered, the replica would also be a fully coherent point-in-time snapshot.
    2. Stream updates to a replica with a time delay. This would be useful for coping with operator error or failures above the block layer by providing access to data that was recently overwritten or deleted.
    3. [optional] On journal roll-forward/apply, generate a reverse-direction rollback journal so that the replica image could also be scrubbed backwards in time. (This requires a read before write and is like impractical on the source/master image, but may be practical on the slave/backup image.)
  2. metadata journal
    1. Accelerate the current 'delta' API that drives the incremental diffs so that it costs O(number of writes) instead of O(number of objects)
The basic design:
  • associate a journal with an RBD image.
    • each journal entry represents an IO operation
    • include a timestamp and any other potentially useful metadata
    • stripe the journal over objects using something similar to Journaler
    • [optional] allow the journal to live in a different pool (e.g., one that is flash-backed)
  • if the image writer understands the feature and it is enabled,
    • apply every write first to the journal, then to the device
    • acknowledge the write as committed either
    • after journal commit (default)
    • after journal and base image (in case the journal is, say, stored in a less-durable but higher-performing pool)
    • [optional] before applying a journaled write, copy the data we are about to overwrite to a second rollback journal
    • on open, replay recent journal operations
    • periodically update a journal position pointer in the rbd image header (to limit replays on open)
  • on read, check the in-memory cache of in-flight (journaling but uncommitted) writes to preserve basic read/write consistency
    • (in reality this should be very rare, since no sane block user would read from a block for which a write is currently in flight)
  • create a 'slave' function that watches the tail of the journal
    • when there is a remote write, apply it locally
      • depending on the local image properties, this may/may not get journaled locally. leave that to the user
    • [optional] add a time delay (e.g., 1 hour) between the journaled write and applying it locally
    • [optional] update the source image with metadata about our replication state. the master may want to control trimming based on our progress instead of using a simple time delay.
  • periodically trim the journal based on time and/or size
  • settle on initial functionality
  • define user interface (librbd, rbd CLI) to fully capture user stories
  • translate to dev tasks

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3