Reliability model » History » Version 2
Jessica Mack, 06/01/2015 09:27 PM
1 | 1 | Jessica Mack | h1. Reliability model |
---|---|---|---|
2 | 1 | Jessica Mack | |
3 | 1 | Jessica Mack | h4. Overview |
4 | 1 | Jessica Mack | |
5 | 1 | Jessica Mack | Current modeling looks at the durability of a single, arbitrary object. That object lives in a Placement Group, stored and replicated on disks in sites, and so we must model the failures and recoveries of all of those components. The models will follow cascades of operations, any of which may succeed, fail or initiate their own recoveries, which will be dependent on other components. The model will help the operators to take decisions that will affect the durability of the file. |
6 | 1 | Jessica Mack | |
7 | 1 | Jessica Mack | h4. Schedule |
8 | 1 | Jessica Mack | |
9 | 1 | Jessica Mack | See [[Tentative schedule|link.]] |
10 | 1 | Jessica Mack | |
11 | 1 | Jessica Mack | h4. Coding methods |
12 | 1 | Jessica Mack | |
13 | 1 | Jessica Mack | Some comments regarding coding methods: |
14 | 1 | Jessica Mack | # Code written is commited to a public branch of the repository named wip-<topic> the same day it is written, even if it is incomplete or not working. |
15 | 1 | Jessica Mack | # Branches are never rebased, they are renamed. For instance if a rebase of wip-foo is necessary, the branch is copied to wip-foo-<timestamp> to preserve history. |
16 | 1 | Jessica Mack | # Unit tests are written at the same time as the code, not at a later time. Unit tests are always run before manual tests. |
17 | 1 | Jessica Mack | |
18 | 1 | Jessica Mack | |
19 | 1 | Jessica Mack | h4. What are the benefits of this feature? |
20 | 1 | Jessica Mack | |
21 | 1 | Jessica Mack | During the development of any storage system many decisions are taken by designers while others are postpone and leave to the system's administrators. Such decisions are difficult to take without knowledge of system's behavior in production. To assure the system's sustainability more attention need to be paid to the trade-offs associated to system's choices. The options associated to erasure coding will definitely affect the durability of the file. |
22 | 1 | Jessica Mack | |
23 | 1 | Jessica Mack | h4. Assumptions |
24 | 1 | Jessica Mack | |
25 | 1 | Jessica Mack | The current model ("source code":https://github.com/ceph/ceph-tools/tree/master/models/reliability) model failure events with a Poisson distribution. Previous work has clearly demonstrated that "real" disk failures doesn't follow a Poisson distribution. A summary of that paper can be found "here":http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/ and the complete paper "here":http://www.cs.cmu.edu/~bianca/fast/. The recommended distribution is the Weibull and Gamma Distribution. It may be better to replace or add those distributions. |
26 | 1 | Jessica Mack | Component reliability is specified using FIT (unit of failure per 109 hours). The advantage of using FIT is that FIT of different components can be added. FIT can be transform easily in AFR (Annualized Failure Rate) with the formula AFR = (FIT * 8760)/109. |
27 | 1 | Jessica Mack | |
28 | 1 | Jessica Mack | h4. More technical details on the model |
29 | 1 | Jessica Mack | |
30 | 1 | Jessica Mack | See [[Technical details on the model|link.]] |
31 | 1 | Jessica Mack | |
32 | 1 | Jessica Mack | h4. GUI |
33 | 1 | Jessica Mack | |
34 | 1 | Jessica Mack | The current GUI is poorly designed. Though, it is currently the best way of playing with different parameters. It has 4 frames for playing with parameters of Disk, RAID type, RADOS copies and RADOS sites. At least I will add another frame for erasure coding parameters. Besides, I will improve a bit the design. |
35 | 1 | Jessica Mack | |
36 | 1 | Jessica Mack | h3. Project Update: [[New RelyGUI]] |
37 | 2 | Jessica Mack | |
38 | 2 | Jessica Mack | h4. Final report |
39 | 2 | Jessica Mack | |
40 | 2 | Jessica Mack | The final report can be found [[Final report|here]]. |