Project

General

Profile

Continuous OSD Stress Testing and Analysis

Summary

Ceph is a robust and self-healing storage system, this blueprint describes an idea that would allow Ceph clusters to evaluate different configurations on differnt OSDs, apply stress to those OSDs and measure their ability to handle said stress (bandwidth throughput and latency). After such a system is in place OSDs could be configured with a variety of filesystems, filesystem options and system tunables. The continuous stresses applied to OSDs and the measurements from said stress could be used to "kill" low performing OSDs (in relation to their peers) and reconfigure them with the current best performing configuration. In this manner newer configurations could be introduced and if they work well, the system, as a whole, would slowly become more performant and find ideal configurations. Think of it as evolution in nature, survival of the fittest, the DNA (configuration) of good performing OSDs spreads epedimically and the DNA of poor performing OSDs would slowly become extinct.

In addition to finding the most ideal configuration, the results of the continous benchmarking could also find nodes that are "sick" due to network interface issues, raid controller issues, or drives performing sub-optimally. This could possibly address http://tracker.newdream.net/issues/3849 if we could check to see if the current configuration is performing well on other OSDs and assigning a suspicion level similar to how the Phi Accural failure detector works.

Owners

  • Kyle Bader (DreamHost)

Interested Parties

  • Kyle Bader (DreamHost)

Current Status

Currrently, there is OSD bench, which gets us close to what I would like to see. The issue with the way osd bench works currently is that the OSD performs work locally and does not take into account information such as network, crc calculations and pg map updates (the last may not be too much of an issue).

Ideally, if something similar to OSD bench could be performed by another node in the cluster, like a handful of dedicated benchmarking clients, it would capture the network and crc calculation work that needs to be performed in real workloads.

OSD bench would also need to report latency statistics, in addition to throughput and duration of the benchmark.

Detailed Description

With a new OSD bench of sorts, a set of clients doing benchmarking and a data collection and analysis store we could actively set OSDs out that are low performing and wait for the PGs on said OSD to remap and backfill to regain durability requirements configured by the operator. Once the PGs on said OSD finish backfilling they could be destroyed and recreated with the best performing configuration, based on the data collected by the benchmarking clients. In this way the cluster is finding it's own ideal configuration by propogating the current best performers.

The added benefit is that your cluster is filesystem heterogenous, think genetic diversity, so that in the event that there is a devestating bug in one filesystem you don't put your entire cluster at risk.

The downside is that you will have PGs mapped to OSDs with different configurations and the performance of those OSDs will be that of the lowest performing member. In my view this is a short term loss and is outweighed by the diversity and promotion of ideal configurations. In the long term the system will become faster and more resilient.

Work items

Coding tasks

Client initiation of OSD benchmarks on an arbitrary OSD.
Bandwidth and latency output to collect, aggregate and use as actionable intelligence.

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3