Project

General

Profile

Actions

Bug #49371

open

Misleading alarm if all MDS daemons have failed

Added by David Piper about 3 years ago. Updated almost 2 years ago.

Status:
Triaged
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen on ceph v14.2.9 in a containerised cluster with 3 MDS nodes

Both standby MGR containers are manually stopped. ceph reports a sensible alarm:

With only 1 MDS remaining we have an alarm on ceph health:

health: HEALTH_WARN
insufficient standby MDS daemons available

Then I manually stop the final, active MDS damon.

Expected:

`ceph health` should report an alarm that there are no active MDS daemons and all filesystems are degraded / inactive.

Actual:

`ceph health` continues to report "insufficent standby". There are no new alarms about the total lack of active MDS daemons.

health: HEALTH_WARN
insufficient standby MDS daemons available

ceph status shows:

mds: cephfs:1 {0=albamons_sc2=up:active(laggy or crashed)}

If I then stop the active (and only remaining) MGR, we got an alarm reported on ceph health:

health: HEALTH_WARN
no active mgr


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #49370: No alarm if all standby MDSs have failedDuplicate

Actions
Actions #1

Updated by David Piper about 3 years ago

Sorry - please ignore the references to MGR in the description. The issue here is just with alarms about MDS when all MDS daemons are inactive.

Actions #2

Updated by Sebastian Wagner about 3 years ago

  • Has duplicate Bug #49370: No alarm if all standby MDSs have failed added
Actions #3

Updated by Sebastian Wagner about 3 years ago

  • Project changed from Ceph to CephFS
Actions #4

Updated by Patrick Donnelly about 3 years ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Priority changed from Normal to High
  • Target version set to v17.0.0
  • Source set to Community (user)
  • Backport set to pacific,octopus
  • Component(FS) MDSMonitor added

Thanks for the report. That is indeed confusing. I think we will change it so laggy/dead daemons are still removed by the mons. That would generate the appropriate health warning.

Actions #5

Updated by Patrick Donnelly almost 2 years ago

  • Target version deleted (v17.0.0)
Actions

Also available in: Atom PDF