Cache Tiering - Improve efficiency of read-miss operations

Suggested changes to the way read-misses are fulfilled from the cache tier to improve efficiencies.

Narendra Narang (Red Hat)

Interested Parties
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
Name Shinobu Kinjo (Red Hat)
Name (Affiliation)

Current Status
Write operations to a cache tier are 3x replicated for durability. To fulfill a read operation not in the cache tier (aka a read-miss operation) is also 3x replicated i.e. the data for a read-miss operation is fetched from the backing tier and 3 copies of it are stored in the caching tier.

Detailed Description
A "cache" tier would typically be configured as a 3x replicated pool. Writes to a cache tier would follow the same rules for durability and immediate consistency i.e. CRUSH -> primary OSD, secondary OSD, tertiary OSD. All 3 write ops to the respective OSD journals would need to be committed before sending an acknowledgement of commit back to the client. Then at some point, based on the LRU algorithm, the writes would be aged out to a backing (most likely configured as an erasure coded) tier.

However, in the case of reads, there are 2 possibilities:
  • First, a read operation which serves the request for I/O directly from the cache tier. This "cache hit" scenario is ideal because there is no additional operation to locate and read/promote the data from the backing tier
  • Second, and a not so ideal scenario, is a read "cache miss" which isn't able to fulfill the read I/O request from the cache tier. So it now has to fetch and promote the data from the backing tier to the caching tier. Additionally, Ceph first promotes the data from the primary OSD's backing tier to the cache pool tier and then also copies this data, over the network, to make 2 more copies elsewhere in the cache pool. Basically, it's promoting, copying and then storing multiple (3) copies in the cache tier, across the cluster's cache pool before it responds to and fulfills the read I/O request.
The read miss behavior is expensive for the following reasons:
  • It waits to serve the request for read I/O until 3x copies are stored in the cache tier and thereby increases response time
  • It has to copy this "redundant" data over the network and thereby results in traffic overhead
  • It "populates" copies of this data unnecessarily on expensive SSDs and thereby reduces efficiencies (cost/performance) of this fast tier.

For a write, storing 3x copies in the cache tier is desirable for durability. However, the same behavior is not ideal for read (miss) operations, since the read request is directed to the primary OSD anyway. In the event of a failure of either the primary OSD or the primary OSD's node, Ceph could locate and promote the data from the alternate OSDs.

Work items
This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well.

Coding tasks
Task 1
Task 2
Task 3

Build / release tasks
Task 1
Task 2
Task 3

Documentation tasks
Task 1
Task 2
Task 3

Deprecation tasks
Task 1
Task 2
Task 3