Project

General

Profile

Reliability model » History » Version 2

Jessica Mack, 06/01/2015 09:27 PM

1 1 Jessica Mack
h1. Reliability model
2 1 Jessica Mack
3 1 Jessica Mack
h4. Overview
4 1 Jessica Mack
5 1 Jessica Mack
Current modeling looks at the durability of a single, arbitrary object. That object lives in a Placement Group, stored and replicated on disks in sites, and so we must model the failures and recoveries of all of those components. The models will follow cascades of operations, any of which may succeed, fail or initiate their own recoveries, which will be dependent on other components. The model will help the operators to take decisions that will affect the durability of the file.
6 1 Jessica Mack
7 1 Jessica Mack
h4. Schedule
8 1 Jessica Mack
9 1 Jessica Mack
See [[Tentative schedule|link.]]
10 1 Jessica Mack
11 1 Jessica Mack
h4. Coding methods
12 1 Jessica Mack
13 1 Jessica Mack
Some comments regarding coding methods:
14 1 Jessica Mack
# Code written is commited to a public branch of the repository named wip-<topic> the same day it is written, even if it is incomplete or not working.
15 1 Jessica Mack
# Branches are never rebased, they are renamed. For instance if a rebase of wip-foo is necessary, the branch is copied to wip-foo-<timestamp> to preserve history.
16 1 Jessica Mack
# Unit tests are written at the same time as the code, not at a later time. Unit tests are always run before manual tests.
17 1 Jessica Mack
18 1 Jessica Mack
19 1 Jessica Mack
h4. What are the benefits of this feature?
20 1 Jessica Mack
21 1 Jessica Mack
During the development of any storage system many decisions are taken by designers while others are postpone and leave to the system's administrators. Such decisions are difficult to take without knowledge of system's behavior in production. To assure the system's sustainability more attention need to be paid to the trade-offs associated to system's choices. The options associated to erasure coding will definitely affect the durability of the file.
22 1 Jessica Mack
23 1 Jessica Mack
h4. Assumptions
24 1 Jessica Mack
25 1 Jessica Mack
The current model ("source code":https://github.com/ceph/ceph-tools/tree/master/models/reliability) model failure events with a Poisson distribution. Previous work has clearly demonstrated that "real" disk failures doesn't follow a Poisson distribution. A summary of that paper can be found "here":http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/ and the complete paper "here":http://www.cs.cmu.edu/~bianca/fast/. The recommended distribution is the Weibull and Gamma Distribution. It may be better to replace or add those distributions.  
26 1 Jessica Mack
Component reliability is specified using FIT (unit of failure per 109 hours). The advantage of using FIT is that FIT of different components can be added. FIT can be transform easily in AFR (Annualized Failure Rate) with the formula AFR = (FIT * 8760)/109. 
27 1 Jessica Mack
28 1 Jessica Mack
h4. More technical details on the model
29 1 Jessica Mack
30 1 Jessica Mack
See [[Technical details on the model|link.]]
31 1 Jessica Mack
32 1 Jessica Mack
h4. GUI
33 1 Jessica Mack
34 1 Jessica Mack
The current GUI is poorly designed. Though, it is currently the best way of playing with different parameters. It has 4 frames for playing with parameters of Disk, RAID type, RADOS copies and RADOS sites. At least I will add another frame for erasure coding parameters. Besides, I will improve a bit the design.
35 1 Jessica Mack
36 1 Jessica Mack
h3. Project Update: [[New RelyGUI]]
37 2 Jessica Mack
38 2 Jessica Mack
h4. Final report
39 2 Jessica Mack
40 2 Jessica Mack
The final report can be found [[Final report|here]].