Osd - new keyvalue backend


Create generic, reusable components to back a ceph-osd with a key/value interface


  • Sage Weil (Inktank)

Interested Parties

  • Haomai Wang (UnitedStack)
  • Yan, Zheng (Intel)
  • Jiangang, Duan (Intel)
  • Anip Patel (Arizona state University(student))
  • Andrey Korolyov (Flops)

Current Status

A bunch of new key/value interfaces are emerging, including New storage hardware, including shingled drives and flash, will behave much better using these emerging interfaces.
The KeyValueDB interface abstracts individual key/value interfaces. In includes a transaction primitive. Currently the only implementation uses leveldb. Alternatives we should consider include

The DBObjectMap interface builds a simple tree structure on top of KeyValueDB that hides some of the namespace complexity (e.g. omap keys vs xattrs keys) and includes a header/intermediate node that allows clone() to happen efficiently.
Haomai has a prototype ObjectStore implementation that uses leveldb on the backend, but it is not quite functional yet, and does not support operations like clone.

Detailed Description

Build an ObjectStore implementation that builds on DBObjectMap (and KeyValueDB) to store everything. There will be no direct filesystem interaction.
  • Since KeyValueDB has a transaction primitive, do not include a journal at all
  • We may want to add this capabability later so that (say) NVRAM can mask the latency of a slow k/v backend, but for now let's ignore it for simplity.
  • FileJournal (and/or the abstract Journal) should be reusable.
  • JournalingObjectStore probably is not reuable; but that's ok
Note on transactions:
  • For now, I suggest we assume we can build on the existing KeyValueDB interface, which includes transactions.
    • leveldb has transactions
    • KVMKV has batch_put, which is limited to 64 keyes
    • kinetic has no transactions
  • To get the atomicity we need, there are a range of tricks available:
    • Write data to new keys, update a sentinal/root key at the end
    • Write data to new keys, batch-rename them into place (if backend allows such a thing)
    • Intent logs
    • Write-ahead transaction journaling (when necessary)
  • I am not sure whether we can efficiently hide a 'transaction layer' beneath the KeyValueDB interface, but for now let's assume we will be able to.

Work items

Coding tasks

  1. refactor OSD awareness of FileStore to make the ObjectStore backend configurable
    1. use a generic method to get an ObjectStore implementation by type
    2. push any FileJournal and FileStore references out of osd/*
  2. DBObjectMap: refactor interface
    1. expose underlying KeyValueDB transactions to caller, so they can bundle several DBObjectMap ops together and capture an entire ObjectStore::Transaction's worth of work)
    2. expose the user prefixes in a generic way, instead of hard-coding in the omap, xattr, and various internal namespaces
  3. stripe file data over keys
    1. Build a class that will implement a file data interface (read extent, write extent, truncate, zero, etc.) on top of DBObjectMap
    2. stripe data over keys of size X (e.g., 1MB, which seems to be the limit people are converging around)
    3. store file size information in a metadata key. maybe this can be DBObjectMap::Header; maybe not
    4. contemplate future optimizations that put small objects "inline" in the Header (or equivalent) key
  4. build a KeyValueDB implementation based on the new Kinetic API
    1. initially, we can just ignore transactions
  5. build a KeyValueDB implementation based on the NVMKV API
    1. opportunistically use batch_put, but otherwise ignore large transaction atomicity
  6. build a KeyValueDB implementations based on RocksDB
    1. allow omap location to be configured independently of osd data path? need to consider commit sequence. :/