the pad is only archived for so long, keep a pad backup
Erasure encoded placement group / pool
* Factor reusable components out of PG/ReplicatedPG and have PG/ReplicatedPG and ErasureCodedPG share only those components and a common PG API.
* Advantages:
* We constrain the PG implementations less while still allowing reuse some of the common logic.
* Individual components can be tested without needing to instantiate an entire PG.
* We will realize benefits from better testing as each component is factored out independent of implementing ErasureCodedPG.
* Some possible common components:
* Peering State Machine: Currently, this is tightly coupled with the PG class. Instead, it becomes a seperate component responsible for orchestrating the peering process with a PG implementation via the PG interface. This would allow us to test specific behavior without creating an OSD or a PG.
* ObjectContexts, object context tracking: this probably includes tracking read/write lock tracking for objects
* Repop state?: not sure about this one, might be too different to generalize between ReplicatedPG and ErasureCodedPG
* PG logs, PG missing: The logic for merging an authoritative PG log with another PG log while filling in the missing set would benefit massively from being testable seperately from a PG instance. It's possible that the stripes involved in ErasureCodedPG will make this impractical to generalize.
* To isolates ceph from the actual library being used ( jerasure, zfec, fecpp, ... ), a wrapper is implemented. Each block is encoded into k data blocks and m parity blocks
* context(k, m, reed-solomon|...) => context* c
* encode(context* c, void* data) => void* chunks[k+m]
* decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are not used
* repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are rebuilt
* The ErasureEncodePG configuration is set to encode each object into k data objects and m parity objects.
* It uses the parity ('INDEP') crush mode so that placement is intelligent. The indep placement avoids moving around a shard between ranks, because a mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails and the shards on 2,3,4 won't need to be copied around.
* The ErasureEncodedPG uses k + m OSDs
* Each object is a chunk
* The rank of the chunk is stored in the object attribute
* Each chunk is divided into B bytes long parts coded independantly. For instance a 1GB chunk can be divided in 4MB parts such that bytes 4BM to 8MB of each chunk can be processed independantly of the rest of the chunk. The K+M parts of each chunk that reside at the same position in the chunk are called a stripe.
* ErasureEncodedPG implementation
* Write a new object that does not need to be divided into parts because it is not too big ( 4MB for instance )
* encode(context* c, void* data) => void* chunks[k+m]
* write chunk[i] to OSD[i] and set the "chunk_rank" attribute to i
* Write offset, length on an existing chunk that is made of a number of independant parts coded separately
* read the stripes containing offset, length
* map the rank of each chunk with the OSD on which it is stored
* for each stripe, decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data and append to a bufferlist
* modify the bufferlist according to the write request, overriding the content that has been decoded with the content given in argument to the write
* encode(context* c, void* data) => void* chunks[k+m]
* write chunk[i] to the corresponding OSD[j]
* Read offset, length
* read the stripes containing [offset, offset + length]
* for each stripe, decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data and append to a bufferlist
* Object attributes
* duplicate the object attributes on each OSD
* Scrubbing
* for each object, read each stripe and write back the repaired part if necessary
* Repair
* When an OSD is decommissioned, when another OSD replaces it, for each object contained in a ErasureEncodedPG using this OSD, read the object, repair each stripe and write back the part that resides on the new OSD
* SJ - interface
* Do we want to restrict the librados writes to just write full? For writes, write full can be implemented much more efficiently than partial writes (no need to read stripes).
* xattr can probably be handled by simply replicating across stripes.
* omap options:
* disable
* erasure code??
* replicate across all stripes - good enough for applications using omap only for limited metadata
* How do we handle object classes? A read might require a round trip to replicas to fulfill, we probably don't want to block in the object class code during that time. Perhaps we only allow reads from xattrs and omap entries from the object class?
* SJ - random stuff
* PG temp mappings need to be able to specify a primary independently of the acting set order (stripe assignment, really). This is necessary to handle backfilling a new acting[0].
* An osd might have two stripes of the same PG due to a history as below. This could be handled by allowing independent PG objects representing each stripe to coexist on the same OSD.
* [0,3,6]
* [1,3,6]
* [9,3,0]
* hobject_t and associated encodings/stringifications needs a stripe field
* OSD map needs to track stripe as well as pg_t
* split is straightforward -- yay
* changing m,n is not easy
Use cases:
1. write full object
2. append to existing object?
3. pluggable algorithm
4. single-dc store (lower redundancy overhead)
5. geo-distributed store (better durability)
Questions:
object stripe unit size.. per-object or per-pool? => may as well be per-object, maybe with a pool (or aglorithm) default?
Work items:
clean up OSD -> pg interface
factor out common PG pieces (obc tracking, pg log handling, etc.)
...
profit!