Project

General

Profile

Actions

Feature #10369

closed

qa-suite: detect unexpected MDS failovers and daemon crashes

Added by John Spray over 9 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Testing
Target version:
% Done:

100%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
qa-suite
Labels (FS):
qa
Pull request ID:

Description

Currently some of our tests can be run with standby MDSs, and a failover event might occur without our tests noticing.

In the workunit-type tests, this is mainly something we want to ignore, because systems are often unpredictably slow, and as long as the filesystem satisfies the workunit we don't necessarily care if a failover happened.

In the new functional tests, we do want to know if a failover occurred unexpectedly, because some of the tests are poking individual MDSs in quite specific ways, and if a failover happened then we should stop the test rather than proceeding into some arcane unexpected failure mode. These tests also generally expose the system to less load, so the "it was just slow" failovers shouldn't often happen, and in any other unexpected failover cases we would like to stop the test to see what/why happened.

An out-of-thread ticker might be challenging as interrupting the main thread to inject a failure is not a normal thing to do, so maybe put it in the wait_until* helpers and as a pre-check to some Filesystem methods, so that we will always detect reasonably early when something went weird. Once we have that checker method, we should also introduce a check that all the daemons are really running, so that if e.g. an OSD crashes unexpectedly, we detect it immediately rather than waiting a long time for at timeout of some kind to end the test.


Subtasks 1 (0 open1 closed)

Bug #41133: qa/tasks: update thrasher designClosedJos Collin

Actions

Related issues 3 (0 open3 closed)

Related to CephFS - Bug #41398: qa: KeyError: 'cluster' in ceph.stopResolvedPatrick Donnelly

Actions
Has duplicate CephFS - Bug #11314: qa: MDS crashed and the runs hung without ever timing outDuplicate

Actions
Has duplicate CephFS - Bug #12821: mds_thrasher: handle MDSes failing on startupDuplicate08/28/2015

Actions
Actions #1

Updated by John Spray over 9 years ago

  • Tracker changed from Fix to Feature
Actions #2

Updated by Greg Farnum over 8 years ago

  • Priority changed from Normal to High

We just keep re-creating this feature: #12821

Actions #3

Updated by John Spray over 8 years ago

  • Category set to Testing
Actions #4

Updated by Patrick Donnelly about 6 years ago

  • Assignee set to Patrick Donnelly
  • Target version set to v14.0.0
  • Source changed from other to Development
  • Component(FS) qa-suite added
  • Labels (FS) qa added
Actions #5

Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0
Actions #6

Updated by Patrick Donnelly almost 5 years ago

  • Assignee changed from Patrick Donnelly to Jos Collin
  • Start date deleted (12/18/2014)
Actions #7

Updated by Patrick Donnelly almost 5 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 28378
Actions #8

Updated by Patrick Donnelly almost 5 years ago

  • Related to deleted (Bug #11314: qa: MDS crashed and the runs hung without ever timing out)
Actions #9

Updated by Patrick Donnelly almost 5 years ago

  • Has duplicate Bug #11314: qa: MDS crashed and the runs hung without ever timing out added
Actions #10

Updated by Patrick Donnelly almost 5 years ago

  • Has duplicate Bug #12821: mds_thrasher: handle MDSes failing on startup added
Actions #11

Updated by Patrick Donnelly over 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions #12

Updated by Patrick Donnelly over 4 years ago

  • Related to Bug #41398: qa: KeyError: 'cluster' in ceph.stop added
Actions

Also available in: Atom PDF