qa: MDS crashed and the runs hung without ever timing out
The MDS crashed in many of these, but for some reason the tasks didn't notice and kept running.
Anyway, we need to make it notice in cases like this so that the failures don't lock up the whole lab overnight and prevent any tests from running. :(
#1 Updated by John Spray about 4 years ago
This was mentioned in #10369:
Once we have that checker method, we should also introduce a check that all the daemons are really running, so that if e.g. an OSD crashes unexpectedly, we detect it immediately rather than waiting a long time for at timeout of some kind to end the test.
#2 Updated by Greg Farnum about 3 years ago
- Subject changed from teuthology: MDS crashed and the runs hung without ever timing out to qa: MDS crashed and the runs hung without ever timing out
- Priority changed from High to Normal
We clearly aren't treating this as very important, and I think we've had more trouble with OSDs doing this than MDSes lately, so whatever.
#7 Updated by Patrick Donnelly about 1 month ago
- Status changed from In Progress to New
- Assignee changed from Patrick Donnelly to Jos Collin
- Priority changed from Normal to High
- Target version deleted (
- Start date deleted (
- Tags deleted (
- Backport deleted (
- Labels (FS) qa added