Project

General

Profile

Feature #10369

qa-suite: detect unexpected MDS failovers and daemon crashes

Added by John Spray over 9 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Testing
Target version:
% Done:

100%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
qa-suite
Labels (FS):
qa
Pull request ID:

Description

Currently some of our tests can be run with standby MDSs, and a failover event might occur without our tests noticing.

In the workunit-type tests, this is mainly something we want to ignore, because systems are often unpredictably slow, and as long as the filesystem satisfies the workunit we don't necessarily care if a failover happened.

In the new functional tests, we do want to know if a failover occurred unexpectedly, because some of the tests are poking individual MDSs in quite specific ways, and if a failover happened then we should stop the test rather than proceeding into some arcane unexpected failure mode. These tests also generally expose the system to less load, so the "it was just slow" failovers shouldn't often happen, and in any other unexpected failover cases we would like to stop the test to see what/why happened.

An out-of-thread ticker might be challenging as interrupting the main thread to inject a failure is not a normal thing to do, so maybe put it in the wait_until* helpers and as a pre-check to some Filesystem methods, so that we will always detect reasonably early when something went weird. Once we have that checker method, we should also introduce a check that all the daemons are really running, so that if e.g. an OSD crashes unexpectedly, we detect it immediately rather than waiting a long time for at timeout of some kind to end the test.


Subtasks

Bug #41133: qa/tasks: update thrasher designClosedJos Collin


Related issues

Related to CephFS - Bug #41398: qa: KeyError: 'cluster' in ceph.stop Resolved
Duplicated by CephFS - Bug #11314: qa: MDS crashed and the runs hung without ever timing out Duplicate
Duplicated by CephFS - Bug #12821: mds_thrasher: handle MDSes failing on startup Duplicate 08/28/2015

History

#1 Updated by John Spray over 9 years ago

  • Tracker changed from Fix to Feature

#2 Updated by Greg Farnum over 8 years ago

  • Priority changed from Normal to High

We just keep re-creating this feature: #12821

#3 Updated by John Spray over 8 years ago

  • Category set to Testing

#4 Updated by Patrick Donnelly almost 6 years ago

  • Assignee set to Patrick Donnelly
  • Target version set to v14.0.0
  • Source changed from other to Development
  • Component(FS) qa-suite added
  • Labels (FS) qa added

#5 Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0

#6 Updated by Patrick Donnelly almost 5 years ago

  • Assignee changed from Patrick Donnelly to Jos Collin
  • Start date deleted (12/18/2014)

#7 Updated by Patrick Donnelly almost 5 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 28378

#8 Updated by Patrick Donnelly almost 5 years ago

  • Related to deleted (Bug #11314: qa: MDS crashed and the runs hung without ever timing out)

#9 Updated by Patrick Donnelly almost 5 years ago

  • Duplicated by Bug #11314: qa: MDS crashed and the runs hung without ever timing out added

#10 Updated by Patrick Donnelly almost 5 years ago

  • Duplicated by Bug #12821: mds_thrasher: handle MDSes failing on startup added

#11 Updated by Patrick Donnelly over 4 years ago

  • Status changed from Fix Under Review to Resolved

#12 Updated by Patrick Donnelly over 4 years ago

  • Related to Bug #41398: qa: KeyError: 'cluster' in ceph.stop added

Also available in: Atom PDF