Osd - clone from journal on btrfs


The OSD normally does a double-write, once to the journal, and then to the backing file system. If we are using btrfs, and the journal is a btrfs file, we can avoid the second write by cloning large writes into their final objects.


  • Samuel Just (Inktank)

Interested Parties

  • Sage Weil (Inktank)
  • Samuel Just (Inktank)
  • Mark Nelson (Inktank)
  • Haomai Wang(UnitedStack)
  • Anip Patel (Arizona State University)

Current Status

Currently the journal events are opaque lumps of data.
Journaling is usually done in 'parallel' mode on btrfs, which means the journal and fs writes are queued at the same time. Clone from journal probably requires that the journal write complete prior to the actual object write.
The journal completion logic cannot currently tell where in the journal file the data portion of the event ended up.

Detailed Description

Rather than writing a second time to the object file, we will instead perform a clone from the journal file, avoiding a second write.
We need to use writeahead journaling when this feature is enabled, either for the entire store, or just for the events/writes that we wish to do clones on.

Work items

Coding tasks

  1. track offset, length of data portion in the journal event metadata
  2. record final location in journal for the data portion
  3. pass final location to journal completion handler
  4. allow the completion handler to do a clone instead of the normal write if certain conditions are met (write is > some minimum size)
  5. ensure that journal replay still performs the complete write
  6. consider a hybrid parallel/writeahead approach where large writes go to journal and then fs, while small writes are still done in parallel.
  7. modify ceph-deploy or other tools to use journal files when the backing fs is btrfs (instead of a separate partition)

Build / release tasks

  1. do performance tests to confirm this is a significant improvement
  2. expand rados test matrix to include all journaling modes

Documentation tasks

  1. document the option
  2. document the internals in the internals section