Osd - tiering - cache pool overlay¶

Summary¶

Layer a fast rados pool over an existing pool as a cache.

Owners¶

Sage Weil (Inktank)

Interested Parties¶

Loic Dachary <loic@dachary.org>
Sam Just (Inktank)
Danny Al-Gaaf

Current Status¶

Each pool is a simple logical container for objects. Objects exist in exactly one pool. Pools are sharded into PGs and uniformly distributed across some set of OSDs by CRUSH. Any caching happesn in individual OSDs or entirely on the client side in an application-specific way (rbd caching != cephfs caching != random librados application's cache).
The only way to use SSDs to accellerate IO is to put them in OSDs, either as a separate SSD-only pool, or with SSD+HDD hybrid file systems backing each OSD (bcache, FlashCache, etc.).

Detailed Description¶

We would like to take an existing pool and layer a cache pool in front of it. Reads would first check the cache pool for a copy of the object, and then fall through to the existing pool if there is a miss. The assumption is that cache pool will be significantly faster than the existing pool, such that a miss does not significantly increase IO latency, and a hit is a big win.
Pool metadata:

The cache pool is a property of the existing pool, specified in the OSDMap's pg_pool_t.
Additional fields describe the policy, which I leave somewhat unspecified right now.

Object metadata:

Each object in the cache pool has a few new object_info_t fields
- eversion_t backing_version; // version of the object in the backend pool
- uint64_t object_version; // user-visible version of this object
  This is because we need to maintain the illusion of increasing version numbers independent of object movement between the cache and backend pool, so using the pg's version is no longer appropriate. this will not replace the pg version, but will supplement it and be adjusted to increase monotonically. this field will be exposed by public objecter and librados apis instead of the eversion_t::version.

Objecter behavior:

any io (read or write) will first be directed at the cache pool
if the OSD replies with EAGAIN (or some similar error code), we send the request to the backend pool

OSD behavior:

if the object exists in the cache pool, it is assumed to be complete and up to date. we process the read and return.
if the object does not exist, we can either
- EAGAIN
- block, read the object from the backend pool, then satisfy the request (or ENOENT)
if the operation is a mulit-object operations (clonerange, etc.), we can proceed if we have all copies; if not, we ensure that all copies have been written back and then EAGAIN

Object creation:

in certain cases we don't care if the object previously existed. make a helper to determine if this is the case for a given transaction (check for things like WRITEFULL, a REMOVE that precedes the op). if true, process the write immediately without checking the backend pool.

Cache eviction:

Cache eviction is handled by the OSD, independently on each PG.
To evict an object, it will issue a writefull on the object to the backend pool.
If that completes with no intervening read/write to the object (i.e., still cold), we remove the object from the cache pool.
We can also silently back off if we decide the object is hot again

New librados operations:

PROMOTE: promote an object from the backend store
FLUSH: writeback any "dirty" changes in the cache pool to the backend pool
EVICT: evict an object from the cache pool
(these operations would all be special cased to avoid the normal cache/EGAIN checks. possibly with their own rados op type)

Eviction policy:

the pg_pool_t describes the policy. probably something like a high-water mark to trigger eviction on the osds
The PG should use a bloom filter to approximate object temperature.
To evict, the PG can enumerate objects and evict any object that is not warm.
We could store some additional metadata (like atime) if we think our pool is fast enough, perhaps in leveldb.
The eviction code/policy should be modular so that we can adjust this approach as we go.

Work items¶

Coding tasks¶

independent object_info_t version: plumb through rados, MOSDOp, Objecter, librados
pg_pool_t: cache_pool property
objecter: send requests to cache pool, then regular pool
osd: ability to read or write objects to/from backend pool (objecter or push/pull? latter i think)
osd: basic io decision: read/write from/to cache pool, or EAGAIN
test: manually populate cache pool, run unit/stress tests
librados, osd: explicit promote operation
osd: transparently promote objects on read/write
librados, osd: explicit flush operation (writeback current value)
librados, osd: explicit evict operation
temperature tracking (this overlaps with the other blueprint!)
implement eviction policy

Files (0)

Updated by Jessica Mack almost 9 years ago · 1 revisions

Project

General

Profile

Ceph

Sidebar¶

Wiki