Osd - tiering - cache pool overlay


Layer a fast rados pool over an existing pool as a cache.


  • Sage Weil (Inktank)

Interested Parties

Current Status

Each pool is a simple logical container for objects. Objects exist in exactly one pool. Pools are sharded into PGs and uniformly distributed across some set of OSDs by CRUSH. Any caching happesn in individual OSDs or entirely on the client side in an application-specific way (rbd caching != cephfs caching != random librados application's cache).
The only way to use SSDs to accellerate IO is to put them in OSDs, either as a separate SSD-only pool, or with SSD+HDD hybrid file systems backing each OSD (bcache, FlashCache, etc.).

Detailed Description

We would like to take an existing pool and layer a cache pool in front of it. Reads would first check the cache pool for a copy of the object, and then fall through to the existing pool if there is a miss. The assumption is that cache pool will be significantly faster than the existing pool, such that a miss does not significantly increase IO latency, and a hit is a big win.
Pool metadata:
  • The cache pool is a property of the existing pool, specified in the OSDMap's pg_pool_t.
  • Additional fields describe the policy, which I leave somewhat unspecified right now.
Object metadata:
  • Each object in the cache pool has a few new object_info_t fields
    • eversion_t backing_version; // version of the object in the backend pool
    • uint64_t object_version; // user-visible version of this object
      This is because we need to maintain the illusion of increasing version numbers independent of object movement between the cache and backend pool, so using the pg's version is no longer appropriate. this will not replace the pg version, but will supplement it and be adjusted to increase monotonically. this field will be exposed by public objecter and librados apis instead of the eversion_t::version.
Objecter behavior:
  • any io (read or write) will first be directed at the cache pool
  • if the OSD replies with EAGAIN (or some similar error code), we send the request to the backend pool
OSD behavior:
  • if the object exists in the cache pool, it is assumed to be complete and up to date. we process the read and return.
  • if the object does not exist, we can either
    • EAGAIN
    • block, read the object from the backend pool, then satisfy the request (or ENOENT)
  • if the operation is a mulit-object operations (clonerange, etc.), we can proceed if we have all copies; if not, we ensure that all copies have been written back and then EAGAIN
Object creation:
  • in certain cases we don't care if the object previously existed. make a helper to determine if this is the case for a given transaction (check for things like WRITEFULL, a REMOVE that precedes the op). if true, process the write immediately without checking the backend pool.
Cache eviction:
  • Cache eviction is handled by the OSD, independently on each PG.
  • To evict an object, it will issue a writefull on the object to the backend pool.
  • If that completes with no intervening read/write to the object (i.e., still cold), we remove the object from the cache pool.
  • We can also silently back off if we decide the object is hot again
New librados operations:
  • PROMOTE: promote an object from the backend store
  • FLUSH: writeback any "dirty" changes in the cache pool to the backend pool
  • EVICT: evict an object from the cache pool
  • (these operations would all be special cased to avoid the normal cache/EGAIN checks. possibly with their own rados op type)
Eviction policy:
  • the pg_pool_t describes the policy. probably something like a high-water mark to trigger eviction on the osds
  • The PG should use a bloom filter to approximate object temperature.
  • To evict, the PG can enumerate objects and evict any object that is not warm.
  • We could store some additional metadata (like atime) if we think our pool is fast enough, perhaps in leveldb.
  • The eviction code/policy should be modular so that we can adjust this approach as we go.

Work items

Coding tasks

  1. independent object_info_t version: plumb through rados, MOSDOp, Objecter, librados
  2. pg_pool_t: cache_pool property
  3. objecter: send requests to cache pool, then regular pool
  4. osd: ability to read or write objects to/from backend pool (objecter or push/pull? latter i think)
  5. osd: basic io decision: read/write from/to cache pool, or EAGAIN
  6. test: manually populate cache pool, run unit/stress tests
  7. librados, osd: explicit promote operation
  8. osd: transparently promote objects on read/write
  9. librados, osd: explicit flush operation (writeback current value)
  10. librados, osd: explicit evict operation
  11. temperature tracking (this overlaps with the other blueprint!)
  12. implement eviction policy