Version 1 - History - Final report - Ceph - Ceph

1

Jessica Mack

h1. Final report

2

3

by Veronica Estrada Galinanes (koleosfuscus)

4

My webpage: http://members.unine.ch/veronica.estrada/index.html

5

6

{{toc}}

7

8

h3. Summary

9

10

The goal of this work is to facilitate the design and operation of a storage cluster by modelling the impact that different replication schemes have on the durability of the stored content. The model is based on Ceph but can be applied on any other object-based storage system. Ceph is a flexible platform to construct a cluster from different “storage bricks”. The software used for this report is available online and open-source. We provide an outline analysis of system reliability and release the Markov model used for the case of the irregular fault-tolerance codes that are being implemented in Ceph.

11

12

This project was done in the context of Google Summer of Code 2014. Ceph is participating as a mentoring organization for the first time. "Link":http://ceph.com/gsoc2014/

13

14

The mentors of this projects are: Kyle Bader and Loic Dachary

15

16

h3. Background

17

18

In order to estimate durability, we need to calculate the system reliability. Note that durability differs from availability. Availability is the probability that an object is accesible at any point of time. Durability is the probability that the object is not permanently lost after a failure. While a simple formula for the latter is the ratio between the uptime and the total life time of the system, the former is usually estimated with Markov Models.

19

20

The reliability of a system is a first indication of trustworthiness. Redundancy is the standard approach to increase the reliability of a storage system. The major mechanisms are the traditional RAID in its different configurations, replication, MDS and non-MDS erasure codes.

21

22

MDS is an acronym for maximum distance separable codes. In the context of storage systems, codes are denoted by (k, m)-MDS where m=n-k is the number of arbitrary failures that the code can tolerate. Fault tolerance is characterized  by this distance, m + 1 (the numbers of erasures that cause data loss). Codes that don't belong to this category, non-MDS codes, can tolerate more than m failures but this "extra" failures are not arbitrary.

23

24

h3. Ceph in a nutshell

25

26

Ceph is a distributed storage system designed for high-throughput and low latency at a petabyte scale. The Ceph object store, also known as RADOS, is the intelligence inherent in the Ceph building blocks used to construct a storage cluster. We focus our analysis on a cluster constructed from object storage devices (OSDs). The Crush algorithm calculates where the object lives.

27

28

h4. RADOS

29

30

It is the default storage mechanism to distribute objects across a Ceph cluster. Objects are stored in replicated- or in erasure-coded-pools.

31

32

A pool contains pg_num placement groups (PG) that are specified at the moment of pool creation. The number of PG increases as the cluster scale to offer a better load-balance. Each PG has a different level of replication.

33

34

h4. Erasured pool

35

36

Each object inside an erasure coded pool is represented by k data chunks and m coding chunks. The pool uses k+m OSDs and supports the lost of m of them.

37

38

Erasure codes save a substancial amount of storage space while providing high resilience. The storage savings are paid with higher bandwidth requirements. Reading and writing content require to establish more network connections. Besides, in a faulty scenario, repairing one missing block requires to read and write multiple blocks.

39

40

A typical use of an erasured pool is in cold storage or as part of hybrid solutions to reduce the price of storage. For more information on this project see http://ceph.com/docs/firefly/dev/erasure-coded-pool/.

41

42

h4. Replicated pool

43

44

Replication is most widely accepted for cases where high throughput is a requirement. Replication can be seen as a special case of erasures where m=1.

45

46

h4. Locally Repairable Codes

47

48

The bandwidth overhead introduced by erasure codes is a problem in geographically dispersed data centers. The time to do repair an object has impact on the durability of the object. One of the most attractive reasons to implement LRC is to reduce repair time and consequently to increase both availability and durability of the objects.

49

50

The implementation of LRC in Ceph is strongly based on Xorbas (Sathiamoorthy, Asteris, Papailiopoulos, Dimakis, Vadali, Chen, & Borthakur, 2013). In such code non-MDS and MDS codes are combined. Details on the project are updated "here":https://wiki.ceph.com/Planning/Blueprints/Submissions/osd%3A_Locally_repairable_code

51

52

h3. Reliability estimation

53

54

Reliability R(t) is an standard measurement of the dependability of a system. There are many ways to estimate the reliability of the system. Unfortunately, using two different methods to compare and validate results is not straightforward. The +variety of assumptions+ taken by different tool designers generate meaningless results. This problematic has been previously documented (Schroeder and Gibson, 2007; Greenan, Plank & Wylie, 2010).

55

56

The analysis of system reliability is commonly done with +Markov models+. Obtaining meaningful metrics for real-world applications is a challenge. Outside of the controlled environment of a simulation, there are many components and subcomponents that can fail. How this failures occur and how they are repaired determines the system's status. Thought, the number of considered states should be kept at minimum  to keep the problem tractable. There is a tradeoff between the tractability and the meaningfulness of the model. Another relevant problem is knowing the +failure distribution+. Often model designers assume for the sake of simplicity that failures appear at a constant rate, i.e. a Poisson distribution. Though, failures don't follow a Poisson distribution. Schroeder (2007) as well as other authors recommend Weibull or Gamma Distribution.

57

58

h4. Combinatorial analysis

59

60

A very simple but useful method is combinatorial analysis (Shooman, 2002). It is the first step to compare and understand the reliability offered by different redundant methods, or by the same method with different parameters. In this analysis, we can skip Markov models and rigorous failure models. We analyze the redundant components of the system, in our case the replicas or blocks in a stripe, with concepts of probability and reliability block diagrams. Such schema represents the number of paths that connect cause and effect. In our case, cause is to request an object and the effect is the answer that we obtain from the ceph storage. If we get the get the object, the system successes and if we cannot get the object, the system fails. The basic block diagrams represent a serial and parallel configuration.

61

62

+Block placement+ has an important effect on the object durability. To maintain the redundancy, every block in Ceph must be mapped to different OSDs. This is achieved with the Crush algorithm. We simply assume that each OSD fails independently. This assumption permits skipping the burden of conditional probabilities. The expansion of union terms is simplified to term additions and the intersection terms into products.

63

64

The probability of a block to be alive is P(xi). For further clarity, we assume that after x hours of operation each block has P(xi) = p and probabilty of failure q = 1 - p.

65

66

h4. Serial configuration

67

68

Serial connections require that all blocks are alive. There is only one path to connect cause and effect. The reliability of a series chain is always smaller than the smallest reliability in the chain.

69

R(t) = P(x1, x2, ...,xn) = P(x1)P(x2)...P(xn) = pn

70

71

h4. Parallel configuration

72

73

Parallel connections has multiple paths between input to output. If at least one of the paths doesn't fail, the object will be read. Parallel configuration is easier to calculate by using the probability of failures.

74

R(t) = P(x1 + x2 + ... + xn) = 1 - (1 - p)(1 - p)...(1 - p) = 1 - qn

75

The parallel configuration represents the classical replication mode.

76

77

h4. r-out-of-n configuration

78

79

This useful model is applied to analyze MDS erasure codes. In a (k, m) Reed-Solomon code, any k blocks are need to read the object, in other words the code tolerates m failures. When the components are independent, the formula to calculate reliability is derived from a binomial distribution.

80

!RS_equation.jpg!

81

82

Now, we can roughly estimate the reliability for various redundancy configurations and various probabilities. To construct a reference table, we choose common configurations for RS. In particular (3,6) RS has the same 2x storage overhead than the popular triple-replication option. (6,3) RS is the code used in the study for erasure codes in Windows Azure platform at Microsoft. (10,4) RS is the code used by Sathiamoorthy (2013) in Xorbas at Facebook.

83

84

_Reference Table - Reliability values for various configurations_

85

| 	 |   1 |   2 |    3|    4|    5|    6|

86

|p, q	 |3-replication |4-replication |5-replication |(3,6) RS |(6,3) RS |(10,4) RS |

87

|0.95, 0.05|3 nines|5 nines|6 nines|7 nines|3 nines|3 nines|

88

|0.98, 0.02|5 nines|6 nines|8 nines|10 nines|4 nines|5 nines|

89

|0.99, 0.01|6 nines|8 nines|10 nines|12 nines|5 nines|6 nines|

90

91

h3. Tools

92

93

Ceph reliability model

94

"Source and documentation":https://github.com/ceph/ceph-tools/t...ls/reliability

95

The tool models disk, some RAID configurations, rados (replicated pool) and also considers a multi-site environment.

96

The metrics available are listed below:

97

* probability of object survival (per 1 years)

98

* probability of loss due to site failures (per 1 years)

99

* probability of loss due to drive failures (per 1 years)

100

* probability of loss due to NREs during recovery (per 1 years)

101

* probability of loss due to replication failure (per 1 years)

102

* expected data loss per Petabyte (per 1 years)

103

104

It was developed by other programmer sometime ago and it doesn't contain support to the new erasure codes supported by Ceph.

105

106

h4. High-Fidelity Reliability Simulator (HFRS)

107

108

"Source":http://www.kaymgee.com/Kevin_Greenan/Software.html

109

For detailed documentation see "Greenan thesis":http://www.kaymgee.com/Kevin_Greenan/Publications_files/kmgreen-thesis.pdf

110

111

Kevin Greenan defend his PhD thesis titled "Reliability and Efficiency in Erasure-Coded Storage Systems" in 2009. HFRS (Greenan and Wylie, 2008), is evaluated in Greenan's thesis. The tool is developed in Python and is open source.

112

113

This tool is a Monte Carlo simulator that accepts failures/repair distributions for the analysis of linear erasure codes and sector failures. The tool can simulate rare events, an event that will most likely remain unobserved during the period of a standard simulation. This feature is necessary when the simulated system is highly fault-tolerant.

114

115

The metrics provided by HFRS are listed below:

116

* system unreliability

117

* mean time to data loss (MTTDL)

118

119

The tool provides some models for MDS codes, Weaver codes, flat codes and simple product codes. To model irregular fault-tolerance codes the simulator expects as an input the fault tolerance of the code expressed in a fault tolerance vector (FTV) or a fault tolerance matrix (FTM).

120

121

h4. Reliability for LRC codes in HFRS

122

123

We develop a Markov model for (10,6,5) LRC to include the effect of the local repairs in the system reliability. The transition rates from left to right represent disk failures. Each state is identified with the number of remaining blocks, e.g. the first state without failures has 10+6 blocks. There are two possible states after the fourth failure: the code could be repairable or not. The transition is decided according the failure pattern that the code tolerates.

124

125

pd1 and pd2 are "decodability ratios", i.e. the likelihood that the code tolerates a 5th and a 6th failure, respectively. The values are obtained through a brute force enumeration of the failure patterns. In this model, we use pd1=0.88 and pd2=0.82.

126

127

Figure 1 - _Markov model for (10,6,5) LRC_

128

!Markov_Ceph_10_6_5_LRC.png!

129

130

The description of the model (needed as HFRS input) can be found "here":ceph_10_6_5.disk.model

131

132

Results of the simulator:

133

Analytic MTTDL:  1.1576683533e+15

134

Analytic prob. of failure: 7.56663e-11

135

136

h3. Conclusion

137

138

The results presented here are an outline of the methodology to analyze system reliability and to understand the durability of an object when different redundancy methods are implemented.

139

We suggest future work on the automatization of failure pattern generators for irregular fault-tolerant codes like LRC. Brute-force analysis (as we did in this work) is still an alternative for small codes like (10,6,5) LRC.

140

141

h3. Acknowledgements

142

143

The author would like to thank Kevin Greenan and Jay Wylie and the mentors of this project for their feedback.

144

145

h3. References

146

147

Greenan, K., & Wylie, J. J. (2008) High-fidelity reliability simulation of XOR-based erasure codes (Poster submission/Extended abstract). In FAST

148

149

Greenan, K. M., Plank, J. S., & Wylie, J. J. (2010, June). Mean time to meaningless: MTTDL, Markov models, and storage system reliability. InProceedings of the 2nd USENIX conference on Hot topics in storage and file systems (pp. 5-5). USENIX Association.

150

151

Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., ... & Yekhanin, S. (2012, June). Erasure Coding in Windows Azure Storage. In USENIX Annual Technical Conference (pp. 15-26).

152

153

Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., Dimakis, A. G., Vadali, R., Chen, S., & Borthakur, D. (2013, March). XORing elephants: Novel erasure codes for big data. In Proceedings of the VLDB Endowment (Vol. 6, No. 5, pp. 325-336). VLDB Endowment.

154

155

Schroeder, B., & Gibson, G. A. (2007, February). Disk failures in the real world: What does an MTTF of 1, 000, 000 hours mean to you?. In FAST (Vol. 7, pp. 1-16).

156

157

Shooman, M. (2002) Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design. Book. Ed. John Wiley & Sons Inc., New York, NY, USA , ISBN 0471293423

Project

General

Profile

Ceph

Final report » History » Version 1