Project

General

Profile

Erasure encoding as a storage backend » History » Version 1

Jessica Mack, 06/08/2015 11:39 PM

1 1 Jessica Mack
h1. Erasure encoding as a storage backend
2
3
h3. Summary
4
5
For many data sets, the 3x replication in CEPH is perfectly acceptable. However, for very large data sets (such as those in Genomics) the 3x replication starts impacting business models, even assuming the continued reduction in HDD prices. Erasure encoding sharded stores should address this issue.
6
7
!erasure.pg!
8
 
9
h3. Owners
10
11
* Christopher Liljenstolpe (Annai Systems) <cdl@asgaard.org>
12
* Dan Maltbie (Annai Systems) <danm@annaisystems.com>
13
* Loic Dachary loic@dachary.org
14
15
h3. Interested Parties
16
17
* Greg Farnum
18
* Sage Weil
19
* Sam Just
20
* Loïc Dachary
21
* Josh Durgin
22
* Xiaoxi Chen
23
* Zhiteng Huang
24
* Xiaobing Zhou(xzhou40 (AT) hawk.iit.edu)
25
* Yuan Zhou (yuan.zhou at intel dot com)
26
* Jianhui Yuan
27
* Da Chun (ngugc at qq.com)
28
29
h3. Current Status
30
31
Not currently accepted. We are proposing this as a development effort. At this time, Annai is more than willing to help with this, but we don't have the resources (including ceph coders) to materially contribute code. We are "interested" users.
32
33
h3. Discussions (chronological order)
34
35
* erasure encoding ( sorry ) http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/14459
36
* fecpp C++ forward error correction library http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/14560
37
* Distributed Deduplication in a Cloud-based Object Storage System http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/14609
38
* OSD abstract class http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/14627
39
* PG statechart http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/14678
40
* Erasure encoding as a storage backend http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/14866
41
42
h3. Detailed Description
43
44
h3. Acknowledgements
45
46
p=. _First of all, we want to acknowledge that the ideas presented here have been independenty tabled by Stephen Perkins and others on the ceph-devel mailing list._
47
48
h3. Reasoning
49
50
For a class of users, there is a requirement for very durable data, but the cost of meeting that durability by usingxN replication becomes cost prohibitive if the size of the data to be stored is large. An example usecase where this is an issue is genomic data. By using a 3x replication, a 20PB genomic repository requires spinning 60+PB of disk. However, the other features of ceph are very attractive (such as using a common storage infrastructure for objects, blocks, and filesystems, CRUSH, self-healing infrastructure, non-disruptive scaling, etc.
51
52
h3. Use Cases
53
54
Annai systems, as an organization that has an active OpenStack compute and storage fabric, needs to move to a more scalable storage system than we have today (currently 700 TB, scaling to Nx10's of PB over the immediate future, has a need for a solution that will reduce durability overhead in our fabric.
55
56
*2 "tiered" storage*
57
This model is related to the Azure storage model, as detailed in https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf. However, it is somewhat different.
58
What would be proposed here is that there are two pools of storage, however, only the first pool is seen by "normal" clients. That pool is "standard" ceph, with object replication as normal. As an OSD's used storage reaches a high-water mark, another process "demotes" one or more objects (until a low-water mark is satisfied) to the second tier, replacing the object with a "redirect object". That second tier is an erasure-encoded pool. If a "normal" client requests that object, the "redirect object" signals that the object should be re-assembled from the erasure-encoded backing store (second "tier"). The object is then placed in the first tier and the client's request is satisfied. At that point, the erasure encoded version of the object could be deleted, or could be left in place.
59
It is worthwhile to note that each OSD would perform this step independently. As all versions of the object would erasure encode (and be placed) in the same locations in the second tier (thanks to CRUSH), there would not be replication of the sharded data. The primary store could be seen as a cache for the backing (or erasure-encoded) store.
60
61
*single "tier" storage*
62
In this model, all data would be immediately erasure-encoded and stored. The would be replacing the current OSD replication model in all cases with an erasure-encoding backend.
63
64
h3. Erasure Encoded Storage
65
66
# Ceph documentation
67
** Peering http://ceph.com/docs/master/dev/peering/
68
# References
69
** Facebook’s advanced erasure codes http://storagemojo.com/2013/06/21/facebooks-advanced-erasure-codes/
70
** XORing Elephants: Novel Erasure Codes for Big Data http://anrg.usc.edu/~maheswaran/Xorbas.pdf
71
** Indepth discussion of erasure coding ( section 2 ) "Erasure Coding vs. Replication: A Quantitative Comparison":http://oceanstore.cs.berkeley.edu/publications/papers/pdf/erasure_iptps.pdf
72
** A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage http://www.usenix.org/events/fast09/tech/full_papers/plank/plank.pdf
73
** Erasure Coding in Windows Azure Storage https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
74
** Optimizing Forward Error Correction Coding Using SIMD Instructions http://www.randombit.net/bitbashing/2009/01/19/forward_error_correction_using_simd.html
75
** How Good is Random Linear Coding Based Distributed Networked Storage? http://netcod.org/papers/11AcedanskiDMK-final.pdf
76
** Regenerating Codes: A System Perspective http://hal.archives-ouvertes.fr/docs/00/76/42/62/PDF/1204.5028v2.pdf
77
** Erasure Coding for Distributed Storage http://storagewiki.ece.utexas.edu/doku.php?id=start
78
** Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads http://static.usenix.org/event/fast12/tech/full_papers/Khan.pdf
79
# Software
80
** Xorbas Hadoop Project Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
81
** GF-Complete: A Comprehensive Open Source Library for Galois Field Arithmetic http://web.eecs.utk.edu/~plank/plank/www/software.html http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703/gf_complete_0.1.tar
82
** a fast erasure codec which can be used with the command-line, C, Python, or Haskell https://pypi.python.org/pypi/zfec
83
** FECpp is a BSD-licensed C++ forward error correction library. It is based on fec by Luigi Rizzo. http://www.randombit.net/code/fecpp/
84
** The HoloStor library for high performance erasure correcting encode/decode http://code.google.com/p/holostor/
85
86
h3. Functional Approach
87
88
Below we outline a few possible paths to achieve the goal of an erasure encoded ceph backend. We are leaning to the staged approach models, but are putting all three paths that we have come up with in this document. Between the two stagged models proposed below, we can see a usecase for both, and do not find them to be mutually exclusive.
89
90
*Common Considerations*
91
Here are some design points that we believe are common to the introduction of any sharding or erasure encoding storage backend for ceph that would need to be decided on before an actual design could be completed. We have identified our prefered approach, with reasoning, but we are sure we've missed things here.
92
93
*Who Does the sharding and placement?*
94
One approach would be for the primary OSD for a given object to erasure-encode the object and run CRUSH for the entire object with {num_replicas} set to be the desired k of the erasure encoding function. The OSD would store one of the shards in each OSD identified by CRUSH. Each OSD could become "primary" as in existing Ceph, and whichever OSD was "primary" would be responsible for re-calculating a missing or damaged shard when discovered by a deep scrub. Any OSD in that object's PG (the PG now encloses the shards for a given object) could assemble the object if requested by a client.
95
Another approach would be to have another process run on the OSD platform that would, given a set of inputs, take objects and shard them, sending the shards to a set of OSDs. In this case, this other process would be responsible for the sharding and re-assembly, as necessary.
96
Examples of this input could be the HWM/LWM mentioned in the two-tiered storage model above.
97
We now believe that the sharding OSD is the best approach.
98
99
*Redirect object*
100
In the two-tiered model, when an object is "evicted" from the primary pool, a "redirect object" could be placed in the filesystem to indicate that that object was now in the erasure-encoded backing-store. That "redirect object" could be a file with the same object name, but different meta-data, or it could be a file with the same object name, but with, perhaps, the sticky-bit set (or some other file attribute set that does not have other meaning to Ceph).
101
In either case, either the OSD would have to understand the meaning of that "redirect object" file, and know what to do when it processes a file that is identified as a "redirect object", or some other agent would need to understand and act transparently on that "redirect object" such that the OSD only ever sees an object file.
102
One possibility would be to run a "pass-through" fuse file as the file system that the OSD stores to. In most cases, that fuse filesystem would simply pass on the filesystem operations that the OSD is making. However, if the file that is being operated on is redirected, it, instead retrieves the file from the erasure-encoded backing store and replaces the redirection with the actual file, returning the actual file to the OSD. In this case, the primary pool OSD would never have to know about the erasure-encoded shard, and therefore, older versions of OSDs could interoperate in an environment where both erasure-encoded backing stores and current-mode Ceph operate.
103
We no longer see this as the best path forward, as it involves more "moving parts" and does not have a substantial benefit.
104
105
*Erasure coded pool*
106
A new pool type is introduced that encodes all objects using erasure codes instead of replicates. It only allows to create an object, write it sequentially and make it read-only after it is written in full. A later write would result in COW like behavior.
107
108
*Staged, multiple pools*
109
The pool R is replicated, pool E is erasure coded. When a condition is met ( object has not been modified for a long time for instance ), the object O from pool R is copied to pool E. A redirect object replaces the original object. When a write operation reaches the redirect object, the object O from pool E is copied back to pool R. The object O may, or may not be deleted from pool E. If it is not, and the object O is not modified in pool R, then it can be "copied" to pool E upon the above conditions, only in this case the "copy" is a NOOP as the object O already exists in pool E.
110
If this were done independenly for each OSD in pool R for a given object O, there would be no inter-OSD dependency beyond what is already requred in Ceph. Also, if pool R would comprised of faster storage (say SSD's or SAS drives), and pool E were comprised of slower media (such as SATA or spun-down drives), then pool R behaves as a cache for pool E.
111
112
*Staged, single pool*
113
The general idea could be that a pool can contain two types of objects : replicated and erasure coded. And there would be a background task such as scrubbing that would turn replicated objects into erasure coded objects under certain conditions. It would be complicated to modify the code accordingly. And since two pools can share the same hardware with a different hierarchy and rules, the complication is probably unnecessary.
114
115
h3. Proposed model
116
117
Of these options, we would propose the following:
118
119
* The creation of an erasure-encoding aware PG
120
** The PG is configured with n and m where n = number of shards necessary to recover an object an m = the number of total shards created.
121
** The eraure-encoding/sharding function is run over the object, resulting in m shards being created
122
*** The sharding process is caried out by the "primary" OSD in the PG.
123
*** Those shards will be named as <object_handle>_<shard_suffix>_k_(m-1). The <shard_suffix> SHOULD be configurable. k = the specific shard within the m series. An example series, for object "example" with an object ID of <exmpl_id>, n = 10, m = 15 would be <exampl_id>_shard_00_14 through <example_id>_shard_14_14.
124
*** The metadata of the shard would include the following, at a minimum:
125
**** Original object name or handle
126
**** The n and m settings of the PG
127
**** The k of the shard.
128
**** One or two hashes of the shard to validate the health of the shard during scrub.
129
*** Each shard is then dispersed to an OSD that is part of the PG.
130
** Each OSD in a given PG will use the stored hash(es) for each shard that it has stored to detect if the shard has become corrupted.
131
*** In case of corruption, the OSD that detected the corruption will signal the need for a re-shard to the primary OSD for that PG for the object who's shard is corrupted.
132
*** The primary OSD will re-constitute the original object, and then re-shard it.
133
134
135
136
* The creation of a "redirect" object to support the use of erasure-encoding as a multi-tiered storage for Ceph.
137
** The non eOSD should not HAVE to know about the existence of a secondary tier of storage. However, it MAY know about it if desired.
138
** The use of a file-system or re-director would need to be provided to handle native OSDs.
139
** Some housekeeping process that is triggered by conditions (such as HWM and LWM) could be used to move objects between "tiers"
140
*** The most simplistic model (which would meet goals) would be an "evictor" that would evict old or least-recently-used objects from the replicated pool R and place them in the erasure-encoded pool E. The return of the objects to pool R would occur on use.
141
** A system that only implemented erasure-encoded pools would simply be a degenerate case of the system proposed here. In which case no re-direction or housekeeping system would be necessary
142
 
143
144
 
145
h3. Work items
146
147
h3. Coding tasks
148
149
Etherpad: http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend
150
151
h3. Build / release tasks
152
153
# TBD
154
155
h3. Documentation tasks
156
157
# TBD