Project

General

Profile

Osd - tiering - cache pool overlay » History » Version 1

Jessica Mack, 06/22/2015 12:06 AM

1 1 Jessica Mack
h1. Osd - tiering - cache pool overlay
2
3
h3. Summary
4
5
Layer a fast rados pool over an existing pool as a cache.
6
7
h3. Owners
8
9
* Sage Weil (Inktank)
10
11
h3. Interested Parties
12
13
* Loic Dachary <loic@dachary.org>
14
* Sam Just (Inktank)
15
* Danny Al-Gaaf
16
17
h3. Current Status
18
19
Each pool is a simple logical container for objects.  Objects exist in exactly one pool.  Pools are sharded into PGs and uniformly distributed across some set of OSDs by CRUSH.  Any caching happesn in individual OSDs or entirely on the client side in an application-specific way (rbd caching != cephfs caching != random librados application's cache).
20
The only way to use SSDs to accellerate IO is to put them in OSDs, either as a separate SSD-only pool, or with SSD+HDD hybrid file systems backing each OSD (bcache, FlashCache, etc.).
21
22
h3. Detailed Description
23
24
We would like to take an existing pool and layer a cache pool in front of it.  Reads would first check the cache pool for a copy of the object, and then fall through to the existing pool if there is a miss.  The assumption is that cache pool will be significantly faster than the existing pool, such that a miss does not significantly increase IO latency, and a hit is a big win.
25
Pool metadata:
26
* The cache pool is a property of the existing pool, specified in the OSDMap's pg_pool_t.
27
* Additional fields describe the policy, which I leave somewhat unspecified right now.
28
29
Object metadata:
30
* Each object in the cache pool has a few new object_info_t fields
31
** eversion_t backing_version;   // version of the object in the backend pool
32
** uint64_t object_version;          // user-visible version of this object
33
This is because we need to maintain the illusion of increasing version numbers independent of object movement between the cache and backend pool, so using the pg's version is no longer appropriate.  this will *not* replace the pg version, but will supplement it and be adjusted to increase monotonically.  this field will be exposed by public objecter and librados apis instead of the eversion_t::version.
34
35
Objecter behavior:
36
* any io (read or write) will first be directed at the cache pool
37
* if the OSD replies with EAGAIN (or some similar error code), we send the request to the backend pool
38
39
OSD behavior:  
40
* if the object exists in the cache pool, it is assumed to be complete and up to date.  we process the read and return.
41
* if the object does not exist, we can either
42
** EAGAIN
43
** block, read the object from the backend pool, then satisfy the request (or ENOENT)
44
* if the operation is a mulit-object operations (clonerange, etc.), we can proceed if we have all copies; if not, we ensure that all copies have been written back and then EAGAIN
45
46
Object creation:
47
* in certain cases we don't care if the object previously existed.  make a helper to determine if this is the case for a given transaction (check for things like WRITEFULL, a REMOVE that precedes the op).  if true, process the write immediately without checking the backend pool.
48
49
Cache eviction:
50
* Cache eviction is handled by the OSD, independently on each PG.
51
* To evict an object, it will issue a writefull on the object to the backend pool.
52
* If that completes with no intervening read/write to the object (i.e., still cold), we remove the object from the cache pool.
53
* We can also silently back off if we decide the object is hot again
54
55
New librados operations:
56
* PROMOTE: promote an object from the backend store
57
* FLUSH: writeback any "dirty" changes in the cache pool to the backend pool
58
* EVICT: evict an object from the cache pool
59
* (these operations would all be special cased to avoid the normal cache/EGAIN checks.  possibly with their own rados op type)
60
61
Eviction policy:
62
* the pg_pool_t describes the policy.  probably something like a high-water mark to trigger eviction on the osds
63
* The PG should use a bloom filter to approximate object temperature.
64
* To evict, the PG can enumerate objects and evict any object that is not warm.
65
* We *could* store some additional metadata (like atime) if we think our pool is fast enough, perhaps in leveldb.
66
* The eviction code/policy should be modular so that we can adjust this approach as we go.
67
68
h3. Work items
69
70
h4. Coding tasks
71
72
# independent object_info_t version: plumb through rados, MOSDOp, Objecter, librados
73
# pg_pool_t: cache_pool property
74
# objecter: send requests to cache pool, then regular pool
75
# osd: ability to read or write objects to/from backend pool (objecter or push/pull? latter i think)
76
# osd: basic io decision: read/write from/to cache pool, or EAGAIN
77
# test: manually populate cache pool, run unit/stress tests 
78
# librados, osd: explicit promote operation
79
# osd: transparently promote objects on read/write
80
# librados, osd: explicit flush operation (writeback current value)
81
# librados, osd: explicit evict operation
82
# temperature tracking (this overlaps with the other blueprint!)
83
# implement eviction policy