Feature #10369: qa-suite: detect unexpected MDS failovers and daemon crashes - CephFS - Ceph

Actions

Copy link

Feature #10369

closed

qa-suite: detect unexpected MDS failovers and daemon crashes

Added by John Spray over 9 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jos Collin

Category:

Testing

Target version:

Ceph - v15.0.0

% Done:

100%

Source:

Development

Tags:

Backport:

Reviewed:

Affected Versions:

Component(FS):

qa-suite

Labels (FS):

Pull request ID:

28378

Description

Currently some of our tests can be run with standby MDSs, and a failover event might occur without our tests noticing.

In the workunit-type tests, this is mainly something we want to ignore, because systems are often unpredictably slow, and as long as the filesystem satisfies the workunit we don't necessarily care if a failover happened.

In the new functional tests, we do want to know if a failover occurred unexpectedly, because some of the tests are poking individual MDSs in quite specific ways, and if a failover happened then we should stop the test rather than proceeding into some arcane unexpected failure mode. These tests also generally expose the system to less load, so the "it was just slow" failovers shouldn't often happen, and in any other unexpected failover cases we would like to stop the test to see what/why happened.

An out-of-thread ticker might be challenging as interrupting the main thread to inject a failure is not a normal thing to do, so maybe put it in the wait_until* helpers and as a pre-check to some Filesystem methods, so that we will always detect reasonably early when something went weird. Once we have that checker method, we should also introduce a check that all the daemons are really running, so that if e.g. an OSD crashes unexpectedly, we detect it immediately rather than waiting a long time for at timeout of some kind to end the test.

Subtasks 1 (0 open — 1 closed)