Project

General

Profile

Reliability model

Overview

Current modeling looks at the durability of a single, arbitrary object. That object lives in a Placement Group, stored and replicated on disks in sites, and so we must model the failures and recoveries of all of those components. The models will follow cascades of operations, any of which may succeed, fail or initiate their own recoveries, which will be dependent on other components. The model will help the operators to take decisions that will affect the durability of the file.

Schedule

See link.

Coding methods

Some comments regarding coding methods:
  1. Code written is commited to a public branch of the repository named wip-<topic> the same day it is written, even if it is incomplete or not working.
  2. Branches are never rebased, they are renamed. For instance if a rebase of wip-foo is necessary, the branch is copied to wip-foo-<timestamp> to preserve history.
  3. Unit tests are written at the same time as the code, not at a later time. Unit tests are always run before manual tests.

What are the benefits of this feature?

During the development of any storage system many decisions are taken by designers while others are postpone and leave to the system's administrators. Such decisions are difficult to take without knowledge of system's behavior in production. To assure the system's sustainability more attention need to be paid to the trade-offs associated to system's choices. The options associated to erasure coding will definitely affect the durability of the file.

Assumptions

The current model (source code) model failure events with a Poisson distribution. Previous work has clearly demonstrated that "real" disk failures doesn't follow a Poisson distribution. A summary of that paper can be found here and the complete paper here. The recommended distribution is the Weibull and Gamma Distribution. It may be better to replace or add those distributions.
Component reliability is specified using FIT (unit of failure per 109 hours). The advantage of using FIT is that FIT of different components can be added. FIT can be transform easily in AFR (Annualized Failure Rate) with the formula AFR = (FIT * 8760)/109.

More technical details on the model

See link.

GUI

The current GUI is poorly designed. Though, it is currently the best way of playing with different parameters. It has 4 frames for playing with parameters of Disk, RAID type, RADOS copies and RADOS sites. At least I will add another frame for erasure coding parameters. Besides, I will improve a bit the design.

Project Update: New RelyGUI

Final report

The final report can be found here.