Project

General

Profile

Improvement on the cache tiering eviction

Summary
In our previous performance evaluation of cache tiering, when 10% of the data is accessed over 80% of the time, and the cache size exceeds 10-20% of the total data set size, the performance is much lower than expected. This implies the cache evict algorithm doesn't performs well.
In the Infernalis CDS, Sage proposed a similar bp (http://wiki.ceph.com/Planning/Blueprints/Infernalis/cache_tier_improvements%3A_hitsets%2C_proxy_write), with the idea to improve the hit set. Basically this BP wants to continue with that one.

Owners
This is mandatory. List of the owners for this blueprint, along with affilition (if appropriate). Chances are good that this list should include your name!
Zhiqiang Wang (Intel)
Name (Affiliation)
Name

Interested Parties
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
Name (Affiliation)
Name (Affiliation)
Name

Current Status
In the current implementation, recency is measured using the hitset and the mtime. Objects in the same hitset get the same recency. This is coarse-grained. And for those objects which are not in any hitsets, the recency is calculated based on the 'mtime'. 'mtime' is only changed in write ops. This is to say, the recency calculated using 'mtime' doesn't take reads into account. Both of these lead to non-accurate recency.
Recency is used to calculate its position (lower bound and upper bound) in the histogram. Object are evicted based on the calculated 'upper bound' and the
'evict_effort'. Since 'evict_effort' is global to a PG, it is possible an object with a small recency may be evcited, while an object with a large recency may not. Is this reasonable?

Detailed Description
Some thoughts, not yet a solution:
- Recency
'atime'? Need a good way to keep track of it. Is it affordable to keep a list of 'atime' of cached objects in memory? Persisting it to disk turns reads into writes, which is not good. Or a LRU list of the cached objects?
- Reuse distance
The reuse distance for an access to an object is measured as the number of other intervening unique objects referenced since the previous access to this object. See the LIRS algorithm (http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-02-6.pdf). We could calculate the reuse distance from hitsets, or 'atime' if we have it.
- LRU variants
Is it affordable to keep objects list in memory? Consider ARC (https://www.usenix.org/legacy/events/fast03/tech/full_papers/megiddo/megiddo.pdf), it is used in the IBM DS8000 storage system. Or LIRS. Among all of the LRU variants, these two are most practical.

Work items
This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well.

Coding tasks
Task 1
Task 2
Task 3

Build / release tasks
Task 1
Task 2
Task 3

Documentation tasks
Task 1
Task 2
Task 3

Deprecation tasks
Task 1
Task 2
Task 3