Erasure coded storage backend (step 2) » History » Version 2
Jessica Mack, 06/21/2015 12:28 AM
1 | 1 | Jessica Mack | h1. Erasure coded storage backend (step 2) |
---|---|---|---|
2 | 1 | Jessica Mack | |
3 | 1 | Jessica Mack | h3. Summary |
4 | 1 | Jessica Mack | |
5 | 1 | Jessica Mack | See also the [[|transcript]] of the "session":https://www.youtube.com/watch?v=-K8bSHx7zJ0&t=30m00s and the associated "pad":http://pad.ceph.com/p/osd-erasure-coding. |
6 | 1 | Jessica Mack | As of "Dumpling":http://tracker.ceph.com/versions/160 Ceph keeps copies of objects to not loose them when hardware fails. It is an efficient solution for Ceph block storage but when performances are less important than cost, "erasure codes"http://en.wikipedia.org/wiki/Erasure_code achieves significant savings. For instance, instead of buying 3PB to store 1PB, it is enough to buy 1.5PB and have better resilience. Erasure coding an object works by splitting it in K data chunks ( let say 10 ) and computing M parity chunks ( let say 3 ). By storing each of the 13 chunks on different disks, it is possible to rebuild the original object even if 3 of them fail. It is better than the resilience provided by three copies and requires only half the space, but it comes at a price : it uses more CPU, bandwidth and is more complex to design. |
7 | 1 | Jessica Mack | The use cases of "Genomic data":http://www.annaisystems.com/, "backups":http://dachary.org/?p=2087 or others requiring cheap hardware, extremely resilient storage, written once or twice and rarely read are perfect matches for an erasure coded storage backend. Ceph was originaly designed to include "RAID4":https://github.com/ceph/ceph/blob/71531edd8a645a249f5edc984e5d8be85317abf7/src/osd/OSD.cc#L1356 as an alternative to replication and the work suspended during years was resumed after the [[Erasure encoding as a storage backend|first Ceph summit in May 2013.]] |
8 | 1 | Jessica Mack | |
9 | 1 | Jessica Mack | h3. Owners |
10 | 1 | Jessica Mack | |
11 | 1 | Jessica Mack | * Loic Dachary |
12 | 1 | Jessica Mack | * Samuel Just (Inktank) |
13 | 1 | Jessica Mack | |
14 | 1 | Jessica Mack | h3. Current Status |
15 | 1 | Jessica Mack | |
16 | 2 | Jessica Mack | * Scheduled for "Firefly ( Feb 2014 )":http://www.redhat.com/en/technologies/storage/ceph |
17 | 1 | Jessica Mack | * Loic Dachary full time developer |
18 | 2 | Jessica Mack | * Samuel Just "specifications":https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst and code review |
19 | 1 | Jessica Mack | * Placement group code refactoring being worked on |
20 | 2 | Jessica Mack | * High level design "being worked on":https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst with a "PGBackend interface":https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/src/osd/PGBackend.h |
21 | 2 | Jessica Mack | * "Erasure code API design":https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api complete |
22 | 1 | Jessica Mack | |
23 | 1 | Jessica Mack | h3. Detailed Description |
24 | 1 | Jessica Mack | |
25 | 1 | Jessica Mack | Design Doc: https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst |
26 | 1 | Jessica Mack | Currently wip PGBackend interface: https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/src/osd/PGBackend.h |
27 | 2 | Jessica Mack | Developer notes : https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst |
28 | 1 | Jessica Mack | Erasure coding or replication ( pg_pool_t::TYPE_REP ) is a property of a pool and is currently implemented PG.{cc,h} and ReplicatedPG.{cc,h}. The implementation of erasure coding requires a refactoring to allow multiple backends. The first step is to factor out code from ReplicatedPG that is independant of the resilience policy. |
29 | 1 | Jessica Mack | * PGLog is mostly done, it includes missing set as well as the logic to merge during recovery. It already is in master but does not yet have a sensible API. |
30 | 2 | Jessica Mack | * PGRegistry groups the ObjectContext, SnapSetContext and Watchers registry. The API is mostly good and the implementation is being cleaned up. The erasure code backend won't use the SnapSetContext because it is not supported. PG knows nothing about this registry. Once the #5510 "pull request":https://github.com/ceph/ceph/pull/414 is merged, it will be ready to be moved to a PGBackend base class. |
31 | 1 | Jessica Mack | |
32 | 2 | Jessica Mack | Other parts of the code need to be reworked so that ReplicatedPG only contains the part specific to replication. Some code needs to be moved to or from ReplicatedPG and PG. If moving the code is not enough, it will be refactored to isolate the parts that belong to ReplicatedPG only. The "design document":https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst contains an inventory of what needs to be abstracted to PGBackend and the associated rationale. |
33 | 1 | Jessica Mack | Once the ReplicatedPG implementation for a given concept ( backfill, recovery, scrub, peering, read, write ) is reasonably free of code that is common to both erasure coding and replication, the PG implementation will be severed from the ReplicatedPG sub class and changed to use a PGBackend data member and will not be aware of the resilience policy. |
34 | 1 | Jessica Mack | The erasure code plugin library can be implemented with the following API and the jerasure library: |
35 | 2 | Jessica Mack | * erasure_coding_t "*":https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#id1 create_erasure_code_context(g_conf) |
36 | 1 | Jessica Mack | return an object configured to encode and decode according to a given algorithm and a given set of parameters as specified in g_conf. Parameters must be prefixed with erasure-code to avoid name collisions |
37 | 2 | Jessica Mack | <pre>ceph osd pool set-erasure-code <pool> m 10 ceph osd pool set-erasure-code <pool> k 3 ceph osd pool set-erasure-code <pool> algorithm Reed-Solomon</pre> |
38 | 1 | Jessica Mack | |
39 | 1 | Jessica Mack | The following are methods of erasure_coding_t. |
40 | 1 | Jessica Mack | * set<int> minimum_to_decode(const set<int> &want_to_read, const set<int> &available_chunks); |
41 | 1 | Jessica Mack | returns the smallest subset of available_chunks that needs to be retrieved in order to successfully decode want_to_read chunks. |
42 | 1 | Jessica Mack | * set<int> minimum_to_decode_with_cost(const set<int> &want_to_read, const map<int, int> &available) |
43 | 1 | Jessica Mack | returns the minimum cost set required to read the specified chunks given a mapping of available chunks to costs. The costs might allow to consider the difference between reading local chunks vs remote chunks. |
44 | 1 | Jessica Mack | * map<int, buffer> encode(const set<int> &want_to_encode, const buffer &in) |
45 | 1 | Jessica Mack | encode the content of in and return a map associating the chunk number with its encoded content. The map only contains the chunks contained in the want_to_encode set. For instance, in the simplest case M=2,K=1 for a buffer containing AB, calling |
46 | 2 | Jessica Mack | <pre>encode([1,2,3], 'AB') => { 1 => 'A', 2 => 'B', 3 => 'Z' }</pre> |
47 | 1 | Jessica Mack | If only the parity chunk is of interest, calling |
48 | 2 | Jessica Mack | <pre>encode([3], 'AB') => { 3 => 'Z' }</pre> |
49 | 1 | Jessica Mack | * map<int, buffer> decode(const set<int> &want_to_read, const map<int, buffer> &chunks) |
50 | 1 | Jessica Mack | decode chunks to read the content of the want_to_read chunks and return a map associating the chunk number with its decoded content. For instance, in the simplest case M=2,K=1 for an encoded payload of data A and B with parity Z, calling |
51 | 2 | Jessica Mack | <pre>decode([1,2], { 1 => 'A', 2 => 'B', 3 => 'Z' }) => { 1 => 'A', 2 => 'B' }</pre> |
52 | 1 | Jessica Mack | If however, the chunk B is to be read but is missing it will be: |
53 | 2 | Jessica Mack | <pre>decode([2], { 1 => 'A', 3 => 'Z' }) => { 2 => 'B' }</pre> |
54 | 1 | Jessica Mack | |
55 | 1 | Jessica Mack | h3. Work items |
56 | 1 | Jessica Mack | |
57 | 1 | Jessica Mack | h3. Coding tasks |
58 | 1 | Jessica Mack | |
59 | 2 | Jessica Mack | The following list of tasks is either an issue item in the "Ceph tracker":http://tracker.ceph.com/ or a a link to the "erasure code design document":https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst in which links to the tracker can be found for individual tasks. All tasks can be displayed from the "Erasure encoded placement group":http://tracker.ceph.com/issues/4929 feature request. |
60 | 1 | Jessica Mack | # Common PG components refactoring: |
61 | 2 | Jessica Mack | ## "Factor out PG logs":http://tracker.ceph.com/issues/5487 (difficulty: medium) |
62 | 2 | Jessica Mack | ## "Factor out ObjectContext SnapSetContext":http://tracker.ceph.com/issues/5487 (difficulty: medium) |
63 | 2 | Jessica Mack | # "Factor out object writing/replication logic":http://tracker.ceph.com/issues/5433 |
64 | 2 | Jessica Mack | ## "Peering and PG logs":https://github.com/dachary/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#peering-and-pg-logs (difficulty: hard) |
65 | 2 | Jessica Mack | ## "Distinguished acting set positions":https://github.com/dachary/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions (difficulty: hard) |
66 | 2 | Jessica Mack | ## "Scrub":https://github.com/dachary/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#scrub (difficulty: hard) |
67 | 2 | Jessica Mack | ## "Recovery":https://github.com/dachary/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#recovery (difficulty: hard) |
68 | 2 | Jessica Mack | ## "Backfill":https://github.com/dachary/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#backfill (difficulty: hard) |
69 | 2 | Jessica Mack | # "Plugable erasure coding library":http://tracker.ceph.com/issues/5877 |
70 | 2 | Jessica Mack | ## "plugin mechanism and abstract API":http://tracker.ceph.com/issues/5878 (difficulty: easy) |
71 | 2 | Jessica Mack | ## "jerasure plugin":http://tracker.ceph.com/issues/5879 (difficulty: easy) |
72 | 2 | Jessica Mack | ## "erasure coding library plugin API documentation, including a simple example":http://tracker.ceph.com/issues/5880 (difficulty: easy) |
73 | 1 | Jessica Mack | # Write integration tests https://github.com/ceph/ceph-qa-suite |
74 | 1 | Jessica Mack | |
75 | 1 | Jessica Mack | h3. Build / release tasks |
76 | 1 | Jessica Mack | |
77 | 1 | Jessica Mack | # Add a wip-erasure-code branch to http://github.com/ceph/ceph/ |
78 | 1 | Jessica Mack | # Schedule the branch to be run by teuthology http://ceph.com/gitbuilder.cgi |
79 | 1 | Jessica Mack | # Watch the tests results on a daily basis and fix the problems |
80 | 1 | Jessica Mack | |
81 | 1 | Jessica Mack | h3. Documentation tasks |
82 | 1 | Jessica Mack | |
83 | 2 | Jessica Mack | # "Erasure code internal documentation":https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst |
84 | 2 | Jessica Mack | # "erasure code design document":https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst |
85 | 2 | Jessica Mack | # "erasure coding library plugin API documentation, including a simple example":http://tracker.ceph.com/issues/5880 |