Osd - ceph on zfs


Allow ceph-osd to better use of ZFS's capabilities.


  • Sage Weil (Inktank)

Interested Parties

  • Sage Weil (Inktank)
  • Mark Nelson (Inktank)
  • Yan, Zheng (Intel)
  • Haomai Wang (UnitedStack)
  • Wido den Hollander (42on)
  • Eric Eastman (Keeper Technology)
  • Daniele Stroppa (ZHAW)
  • Sam Zaydel (RackTop Systems)
  • Sam Just (Inktank)

Current Status

We have worked to identify and fix the xattr bugs in zfsonlinux such that ceph-osd will run on top of ZFS in the noraml write-ahead journaling mode, just as it will on ext4 or XFS. We do not take advantage of any special ZFS features.

Detailed Description

At a minimum, ZFS's snapshot support could be used the same way it is used on btrfs to provide a stable consistency point to journal relative too, allowing us to use the parallel jounraling mode (which has much better read/modify/write performance).
Looking further forward, I suspect there are much more involved ways that we could take advantage of ZFS, by utilizing the DMU directly instead of using the posix layer. I would like to discuss both the short-term improvements as well as the long-term possibilities in this session.
To abstract the underlying fs functionality out of FileStore, we need an interface that looks like like this:
class BackingFileSystem {

bool can_checkpoint(); ///< true if we can snapshot to allow parallel journaling, etc.
int create_base_volume(); ///< use during mkfs.. mkdir in the degenerate case, create_subvole for btrfs, ...
int list_checkpoints(list<string> *ls); ///< used during mount. list the checkpoints
int rollback_to_checkpoint(string name); ///< used during mount to roll back to the last checkpoint befor ejournal replay
int create_checkpoint_start(string name); ///< start a snap. during sync_entry()
int create_checkpoint_finish();
int remove_checkpoint(string name); ///< trim an old snap

// other btrfs/fs optimizations
int clone_range(...); ///< fall back to copy as necessary

The FileStore::_detect_fs() will need to be refactored to instantiate an implementation of the above instead of the current open-coded checks.
All references to btrfs_stable_commits will be repalced with can_checkpoint().
Once this refactoring is in place, implementing a zfs backend should be pretty straightforward.
  • identify correct zfs snap interface (ioctls?)
  • look at nilfs2?

Work items

Coding tasks

  1. filestore: generalize the snapshot enumeration, creation hooks and other btrfs-specific behaviors such that the btrfs hooks fit into a generic interface
  2. filestore: implement generic backend (xfs, ext4, etc.)
  3. filestore: implement btrfs backend
  4. filestore: clean out all btrfs_* member cruft
  5. filestore: implement a zfs backend that triggers zfs snapshots
  6. ceph-deploy: add zfs to the list of file systems supported by osd create ...

Build / release tasks

  1. include zfsonlinux in ceph-qa-chef on supported platforms
  2. teuthology: add support for fs: zfs
  3. include fs:zfs in the rados test matrix

Documentation tasks

  1. document the filestore backend interface in the internals section of the docs