Project

General

Profile

Feature #14039

Feature #14031: EC overwrites

ECBackend cache extents with unapplied writes

Added by Samuel Just about 5 years ago. Updated about 1 year ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

80%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

While a particular stripe has pending unapplied writes, we need to cache the unstable data to serve reads.

History

#1 Updated by Tomy Cheru over 4 years ago

  • Status changed from New to In Progress

#2 Updated by Tomy Cheru over 4 years ago

  • Assignee set to Tomy Cheru
  • Start date changed from 12/09/2015 to 05/04/2016

#3 Updated by Tomy Cheru over 4 years ago

  • % Done changed from 0 to 10

Erasure coded pool receives a <K,V> will be inflated to next "stripe size" boundary by padding with "zero" if required.
Then striped to stripe size stripes and chunked to "k" data chunks.
"m" parity chunks were added to it.
Each chunk is ranked from 0 to "k+m-1" ranks.
Process continues till all stripes of inflated value is processed.
then all chunks having same rank forms shard of same rank
Each shard such formed written to self or other associated OSD by primary OSD

currently only append operation is possible with EC pool objects.
for partial write to EC objects,
1. need to partially update associated shards
2. "m" parity chunks/striplets need to be recalculated by associated plugin for each changed data chunks

two possibilities for scenario 2

a. full stripe getting updated, parity can be calculated with incoming full stripe
b. partial stripe update require, primary to read other partial part of same stripe from associated OSD/s

(b) causes extra reads from multiple OSD, increases partial write latency.

purpose of this cache is to avoid such latencies

other purpose is to cache stripes written between "apply" and "commit" phase in case of delayed commit approach.
ec transaction is acknowledge once sufficient number of "apply" done and such transactions are queued for commit until a trigger
however subsequent reads after apply should see latest data.
trivial way is to read all base shards from respective OSD/s and apply cached stripes, saves multiple reads of temporary objects in each OSDs

ec stripe cache

stripe cache which cache stripes in full, of a partial incoming write from Apply to Commit for ec transactions.

where to keep

whole purpose of this cache is to hold stripes to assist parity chunk calculation in partial write path and assembling read buffer in read path.
both these operations for an ec object happens in primary OSD of associated PG.
hence the cache has to in primary OSD of associated PG.
init on PG/OSD creation

following options available,
1. stripe cache per PG, on its primary OSD
2. stripe cache per OSD, shared by all PG of which the OSD is primary(shared by PG/s)

Allen suggested an approach in which keep at least couple of stripe per incoming connection.
Implementation of such will end up in some fancy code per Sam(deferred for later exploration)

what to keep

following are considered
1. whole stripe
2. changed extends
3. changed shards

discussion between Sam and Tomy, decided to use option 1

stripe size is property of pool, 4k default
size of cache very much dependent on number of stripes cached and stripe size

How to access

partial write path should update the cache, either by fetching required stripe or update stripe in cache
append path need not update cache
cache "full" should trigger "commit"

read should assemble the shards and then merge with any relevant cache stripe

ecbackend::encode/decode access cache for mentioned cases

pgid/oid/stripe number to be used to get cached stripe(if any)

stripe/s should be invalidated from cache on "commit"

Coherency with multiple cache/s

rbd/rgw cache (TBD)

#4 Updated by Samuel Just over 4 years ago

Tomy Cheru wrote:

Erasure coded pool receives a <K,V> will be inflated to next "stripe size" boundary by padding with "zero" if required.

I don't know what you mean by this. We only pad the end with zeroes, and it's invisible to the user. If we receive a write in the middle of the object, we need to "pad" it with the surrounding data which was previously written (that is, read the stripe). This is out of scope for this task, however.

Then striped to stripe size stripes and chunked to "k" data chunks.
"m" parity chunks were added to it.
Each chunk is ranked from 0 to "k+m-1" ranks.
Process continues till all stripes of inflated value is processed.
then all chunks having same rank forms shard of same rank
Each shard such formed written to self or other associated OSD by primary OSD

currently only append operation is possible with EC pool objects.
for partial write to EC objects,
1. need to partially update associated shards
2. "m" parity chunks/striplets need to be recalculated by associated plugin for each changed data chunks

two possibilities for scenario 2

a. full stripe getting updated, parity can be calculated with incoming full stripe
b. partial stripe update require, primary to read other partial part of same stripe from associated OSD/s

(b) causes extra reads from multiple OSD, increases partial write latency.

purpose of this cache is to avoid such latencies

No, it isn't. We may choose to extend it later for that purpose, but the main purpose of this cache is the part you wrote next.

other purpose is to cache stripes written between "apply" and "commit" phase in case of delayed commit approach.
ec transaction is acknowledge once sufficient number of "apply" done and such transactions are queued for commit until a trigger
however subsequent reads after apply should see latest data.
trivial way is to read all base shards from respective OSD/s and apply cached stripes, saves multiple reads of temporary objects in each OSDs

ec stripe cache

stripe cache which cache stripes in full, of a partial incoming write from Apply to Commit for ec transactions.

where to keep

whole purpose of this cache is to hold stripes to assist parity chunk calculation in partial write path and assembling read buffer in read path.
both these operations for an ec object happens in primary OSD of associated PG.
hence the cache has to in primary OSD of associated PG.
init on PG/OSD creation

It should be within ECBackend, constructed during ECBackend construction.

following options available,
1. stripe cache per PG, on its primary OSD
2. stripe cache per OSD, shared by all PG of which the OSD is primary(shared by PG/s)

Allen suggested an approach in which keep at least couple of stripe per incoming connection.
Implementation of such will end up in some fancy code per Sam(deferred for later exploration)

what to keep

following are considered
1. whole stripe
2. changed extends
3. changed shards

discussion between Sam and Tomy, decided to use option 1

stripe size is property of pool, 4k default
size of cache very much dependent on number of stripes cached and stripe size

How to access

partial write path should update the cache, either by fetching required stripe or update stripe in cache
append path need not update cache

The append path does need to update the cache. An update followed by a read or an append would need the
cache same as any other write. Also, append is the only thing possible right now. As part of this PR, I
actually want the cache being used in the OSD when the black box testing flag is set. I also want a ceph-qa-suite
branch enabling that flag for a few of the ec tests before.

cache "full" should trigger "commit"

Mmmm, cache at <threshhold> should trigger outstanding commits. It should never reach <full>.

read should assemble the shards and then merge with any relevant cache stripe

Read should take care to not read cached stripes.

ecbackend::encode/decode access cache for mentioned cases

I'm not sure what you mean by this.

pgid/oid/stripe number to be used to get cached stripe(if any)

Or this.

stripe/s should be invalidated from cache on "commit"

Not invalidated, evicted. If we later choose to keep data cached for longer for performance reasons, a commit
would not invalidate the data.

Coherency with multiple cache/s

Very no. How could there even be multiple caches? The primary has the cache, and there is only one primary. If we
go through peering, all replicas in the next interval will commit up to last_update, and an empty cache would be
correct. Thus, we can simply discard the cache if the primary changes (actually, on any inverval change).

rbd/rgw cache (TBD)

This is completely irrelevant, I think.

Please create a branch with the cache design part in doc/dev/osd_internals/ec_stripe_cache.rst. It will merge with
the actual code changes.

#5 Updated by Tomy Cheru over 4 years ago

  • % Done changed from 10 to 20

EC Extend Cache.
----------------
Partial writes on an ec pool will have a Two Phase Commit (TPC) scheme. In TPC initial “prepare” phase, the partial data will be prepared, partial parities will be calculated and write partial data and parity to participating shards as temporary shard objects. Once partial data and parities are written to participating shards, a “roll-forward” queue will be populated with current op (to be "roll forward"d later) and clients will be write acknowledged. Likewise more partial writes will be accumulated which has to be readable after clients is write acknowledged. Without extent cache subsequent reads have to perform multiple reads to shards, coalesce temporary objects, in order and fulfill read request. To reduce number of shard reads to fulfill client reads, partially written data will be cached in primary shard while write "prepare" phase.

Non-"rollforward"d data in cache are non-evict-able. Once extent cache reaches a high threshold of utilization, such event will trigger "roll forward" of eligible objects until a low threshold of cache usage is achieved. Client writes will not be stalled on reaching high threshold, however will be blocked if cache usage reaches full utilization.

Extent cache is designed as follows,

Extent Cache is a set of cache objects, have predefined/configurable size(pool level), have a high and low usage threshold, overall protected by lock.
Each cache object is a set of cache extents, having total cached size, identified uniquely by "hoid", flagged evict-able, reference count and is protected by lock.
Each cache extent have non-overlapping ranges of data of "hoid" object.
Extent cache will not keep versions of partially written objects nor the order which it’s been written, rather will have coalesced version of "prepare"d data.

Cache will be initialized on ECBackend init.
src/osd/ECBackend.h

Cache is implemented in src/osd/ECCache.hpp
gtest code in src/test/osd/TestECCache.cc
Enabled with following config files
src/test/Makefile.am
src/test/osd/CMakeLists.txt

Cache is warmed in write path, "check_op()" is an optimal place. Will flag respective cache object "non-evict-able". A simple reference is held when "copyin"
For test purpose, with current AppendOp, in "generate_transactions()" -> "void operator()(const ECTransaction::AppendOp &op)" cache is populated.

in read path "objects_read_async()", for a given <hoid/offset/length>, possible extents will be read from cache and only remaining extents will be read via existing read path. CallClientContexts need to be updated accordingly. A simple reference is held when "copyout"

While cache population, if high usage threshold is surpassed by current Op length, "roll forward" will to be triggered on associated "roll-forward queue".
"roll forward" should be ideally designed to run in a context other than current write context, otherwise the current write will suffer latency.
Triggering of "roll forward" is not tested currently, simply a message is pop'd on reaching high threshold. "roll-forward" processing will flag respective cache object "evict-able"

NOTE on extended use of the extent cache follows,
For calculating parity, need entire stripe data(delta parity use case is an exclusion), however with partial writes it’s possible that only partial stripe is present in incoming op. In such case non-participating shard striplets need to be read in to re-calculate the parity. Extent cache will be used to cache such striplets too. Such will reduce the reads to be performed to fulfill partial writes (sequential partial write use case). Though this is not preliminary purpose if this cache, will use the cache for this purpose too as an added optimization.

#6 Updated by Samuel Just over 4 years ago

  • Assignee changed from Tomy Cheru to Samuel Just
  • % Done changed from 20 to 60

Written, blocked on the tpc ticket for integration, still needs unit tests (not going to bother writing them until I've written the integration code -- might want to change the interface).

#7 Updated by Samuel Just over 4 years ago

  • Status changed from In Progress to 7

#8 Updated by Samuel Just over 4 years ago

  • % Done changed from 60 to 80

#9 Updated by Patrick Donnelly about 1 year ago

  • Status changed from 7 to Fix Under Review

Also available in: Atom PDF