Project

General

Profile

Bug #49371

Misleading alarm if all MDS daemons have failed

Added by David Piper 2 months ago. Updated about 2 months ago.

Status:
Triaged
Priority:
High
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen on ceph v14.2.9 in a containerised cluster with 3 MDS nodes

Both standby MGR containers are manually stopped. ceph reports a sensible alarm:

With only 1 MDS remaining we have an alarm on ceph health:

health: HEALTH_WARN
insufficient standby MDS daemons available

Then I manually stop the final, active MDS damon.

Expected:

`ceph health` should report an alarm that there are no active MDS daemons and all filesystems are degraded / inactive.

Actual:

`ceph health` continues to report "insufficent standby". There are no new alarms about the total lack of active MDS daemons.

health: HEALTH_WARN
insufficient standby MDS daemons available

ceph status shows:

mds: cephfs:1 {0=albamons_sc2=up:active(laggy or crashed)}

If I then stop the active (and only remaining) MGR, we got an alarm reported on ceph health:

health: HEALTH_WARN
no active mgr


Related issues

Duplicated by Ceph - Bug #49370: No alarm if all standby MDSs have failed Duplicate

History

#1 Updated by David Piper 2 months ago

Sorry - please ignore the references to MGR in the description. The issue here is just with alarms about MDS when all MDS daemons are inactive.

#2 Updated by Sebastian Wagner about 2 months ago

  • Duplicated by Bug #49370: No alarm if all standby MDSs have failed added

#3 Updated by Sebastian Wagner about 2 months ago

  • Project changed from Ceph to CephFS

#4 Updated by Patrick Donnelly about 2 months ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Priority changed from Normal to High
  • Target version set to v17.0.0
  • Source set to Community (user)
  • Backport set to pacific,octopus
  • Component(FS) MDSMonitor added

Thanks for the report. That is indeed confusing. I think we will change it so laggy/dead daemons are still removed by the mons. That would generate the appropriate health warning.

Also available in: Atom PDF