Project

General

Profile

Rados cache pool (part 2) » History » Version 1

Jessica Mack, 06/22/2015 01:58 AM

1 1 Jessica Mack
h1. Rados cache pool (part 2)
2
3
h3. Summary
4
5
Balance of work to create a cache pool tier
6
7
h3. Owners
8
9
* Sage Weil (Inktank)
10
* Greg Farnum (Inktank)
11
12
h3. Interested Parties
13
14
* Mike Dawson (Cloudapt)
15
* Yan, Zheng  (Intel)
16
* Jiangang, Duan  (Intel)
17
* Jian, Zhang (Intel)
18
19
h3. Current Status
20
21
About half to two-thirds of the work has been completed:
22
* copy-get and copy-from rados primitives
23
* objecter cache redirect logic (first read from cache tier, then from base pool)
24
* promote on read
25
26
Much of the logic is written but not yet merged:
27
* dirty, whiteout metadata
28
* flush
29
* evict
30
* HitSet bloom filter (or explicit enumeration) tracking of ios
31
32
Balance of effort:
33
* hitset expiration
34
* recover hitset when pg is recovered/migrated/whatever.
35
* [optional] preserve in-memory hitset across peering intervals
36
* stress tests that specifically exercise and validate dirty, whiteout, evict, flush, hitsets
37
* policy metadata for when to flush/evict from cache
38
* agent process/thread/whatever that evicts from cache when it approaches the high water mark 
39
40
h3. Detailed Description
41
42
hitset expiration
43
* osd logic to delete old hitsets (and replicate that deletion) once they are old or reach the max count.  or the pool max values are adjusted.
44
45
policy metadata for flush/evict from cache
46
* add pg_pool_t properties to control when we should
47
** flush dirty metadata,
48
** evicting old items because the pool is getting full
49
** evict any item because it is older than X
50
51
cache agent
52
* this might be a thread, or a python client, or a separate daemon.  discuss.
53
* periodically check pool metadata (stats) vs policy
54
* start at random point in pool and iterate over objects
55
** pull hitset history for current position
56
** estimate idle time for each object
57
** if they are meet some criteria, flush or evict
58
** move to next object; pull new hitset metadata as needed
59
* include some mechanism to throttle
60
61
cachemode_invalidate_forward
62
* implement policy
63
* build a test that adds a cache, populates it, drains it, and disables the cache
64
** add tests to the suite that do this in parallel with a running workload?
65
66
stress tests
67
* extend rados model to simply exercise flush and evict
68
* some sort of test to stress the hitset tracking code
69
* stress workload that promote new data and force eviction of old data (i.e. degenerate streaming workload)
70
* expand qa suite with cache pool tests
71
** explicit stress tests (above)
72
** enable/populate/drain/disable cache pool (and loop) in parallel with other workloads
73
74
h3. Work items
75
76
h4. Coding tasks
77
78
# hitset expiration
79
# policy metadata
80
# cache agent
81
# stress tests
82
83
h4. Documentation tasks
84
85
# document tiering framework
86
# document cache configuration, usage
87
## include limitations (e.g., PGLS results not cache coherent)