Project

General

Profile

Actions

Feature #1007

closed

qa: osd failure and cluster recovery test(s)

Added by Sage Weil about 13 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
qa
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

We need tests of OSD failures that verify the cluster is able to recover. Eventually this will need to be fleshed out to include a variety of failure scenarios that try to get good coverage on the peering and recovery code. We can start out with some pretty simple tests, though:

- restart an osd once, or every few minutes.  verify get back to all active+clean.  maybe within some time bound?
- stop an osd, mark it out. continue operation for a while (dirty lots of objects). then re-add the osd. (this exercises code paths similar to a regular cluster expansion)
- restart multiple (or all) osds simultaneously.

Related issues 1 (0 open1 closed)

Blocked by Ceph - Feature #1212: teuthology: ability to restart daemons while other tasks are runningResolvedSamuel Just06/21/2011

Actions
Actions

Also available in: Atom PDF