Version 1 - History - 1C - Erasure Encoding as a Storage Backend - Ceph - Ceph

1

Jessica Mack

h1. 1C - Erasure Encoding as a Storage Backend

2

3

h3. Live Pad

4

5

The live pad can be found here: "[pad]":http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend

6

7

h3. Summit Snapshot

8

9

Erasure encoded placement group / pool

10

11

* PG/ReplicatedPG API

12

** The goal is not to factor out a base class from which an ErasureEncodedPG could be derived, it is to reverse engineer the PG API

13

** PG/ReplicatedPG are really a single class although they grew from two different classes, back when RAID4 was to be implemented : the difference between the two gradually disapeared

14

** Define an API ( IPG ? ) class for PG/ReplicatedPG

15

** Change the code using PG/ReplicatedPG to use the API class rather than the actual PG/RepliatedPG classes

16

*** this may involve modifying the code of the calling classes to use accessors when data members are referenced

17

*** the callers are not otherwise modified, to minimize the change

18

*** it is assumed that the the API is defined by what is used and no attempt is made to improve

19

** Tests are written for the API to cover 100% of the LOC and most of the expected functionalities implemented by PG/ReplicatedPG.

20

* Factor reusable components out of PG/ReplicatedPG and have PG/ReplicatedPG and ErasureCodedPG share only those components and a common PG API.

21

** Advantages:

22

*** We constrain the PG implementations less while still allowing reuse some of the common logic.

23

*** Individual components can be tested without needing to instantiate an entire PG.

24

*** We will realize benefits from better testing as each component is factored out independent of implementing ErasureCodedPG.

25

** Some possible common components:

26

*** Peering State Machine:  Currently, this is tightly coupled with the PG class.  Instead, it becomes a seperate component responsible for orchestrating the peering process with a PG implementation via the PG interface.  This would allow us to test specific behavior without creating an OSD or a PG.

27

*** ObjectContexts, object context tracking: this probably includes tracking read/write lock tracking for objects

28

*** Repop state?: not sure about this one, might be too different to generalize between ReplicatedPG and ErasureCodedPG

29

*** PG logs, PG missing: The logic for merging an authoritative PG log with another PG log while filling in the missing set would benefit massively from being testable seperately from a PG instance.  It's possible that the stripes involved in ErasureCodedPG will make this impractical to generalize.

30

* To isolates ceph from the actual library being used ( zfec, fecpp, ... ), a wrapper around the erasure encoding library is implemented. Each block is encoded into k data blocks and m parity blocks

31

** encode(void* data, k, m) => void* data[k], void* parity[m]

32

** decode(void* data[k], void* parity[m]) => void* data

33

** repair(void* data[k], void* parity[m], indices_of_damaged_blocks[]) => void* data

34

* The ErasureEncodePG configuration is set to encode each object into k data objects and m parity objects.

35

** It use the parity ('INDEP') crush mode so that placement is intelligent. The indep  placement avoids moving around a shard between ranks, because a mapping  of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails  and the shards on 2,3,4 won't need to be copied around.

36

** The ErasureEncodedPG uses k + m OSDs, numbered Do .. Dk-1 and C0 ... Cm-1

37

** Each object is a strip

38

** Each stripe has a fixed size of B bytes

39

* ErasureEncodedPG implementation

40

** Write offset, length

41

*** read the stripes containing offset, length

42

*** for each stripe, decode(void* data[k], void* parity[m]) => void* data and append to a bufferlist

43

*** modify the bufferlist with the write request

44

*** encode(void* data, k, m) => void* data[k], void* parity[m]

45

*** write data[0] to Do, data[1] to D1 ... data[k-1] to Dk-1 and parity[0] to C0 ... parity[m-1] to Cm-1

46

** Read offset, length

47

*** read the stripes containing offset

48

*** for each strip, decode(void* data[k], void* parity[m]) => void* data and append to a bufferlist

49

** Object attributes

50

*** duplicate the object attributes on each OSD

51

** Scrubbing

52

*** for each object, read each stripe and write back if a repair was necessary

53

** Repair

54

*** when an OSD is decomissioned, when another OSD replaces it, for each object contained in a ErasureEncodedPG using this OSD, read the object, repair each stripe and write back the strip that resides on the new OSD

55

* SJ - interface

56

** Do we want to restrict the librados writes to just write full?  For writes, write full can be implemented much more efficiently than partial writes (no need to read stripes).

57

** xattr can probably be handled by simply replicating across stripes.

58

** omap options:

59

*** disable

60

*** erasure code??

61

*** replicate across all stripes - good enough for applications using omap only for limited metadata

62

** How do we handle object classes?  A read might require a round trip to replicas to fulfill, we probably don't want to block in the object class code during that time.  Perhaps we only allow reads from xattrs and omap entries from the object class?

63

* SJ - random stuff

64

** PG temp mappings need to be able to specify a primary independently of the acting set order (stripe assignment, really).  This is necessary to handle backfilling a new acting[0].

65

** An osd might have two stripes of the same PG due to a history as below.  This could be handled by allowing independent PG objects representing each stripe to coexist on the same OSD.

66

*** [0,3,6]

67

*** [1,3,6]

68

*** [9,3,0]

69

** hobject_t and associated encodings/stringifications needs a stripe field

70

** OSD map needs to track stripe as well as pg_t

71

** split is straightforward -- yay

72

** changing m,n is not easy

73

74

Use cases:

75

# write full object

76

# append to existing object?

77

# pluggable algorithm

78

# single-dc store (lower redundancy overhead)

79

# geo-distributed store (better durability)

80

81

Questions:

82

83

p((. object stripe unit size.. per-object or per-pool?  => may as well be per-object, maybe with a pool (or aglorithm) default?

84

85

Work items:

86

87

p((. clean up OSD -> pg interface

88

    factor out common PG pieces (obc tracking, pg log handling, etc.)

89

...

90

    profit!

Project

General

Profile

Ceph

1C - Erasure Encoding as a Storage Backend » History » Version 1