Bug #49371: Misleading alarm if all MDS daemons have failed - CephFS - Ceph

Actions

Copy link

Bug #49371

open

Misleading alarm if all MDS daemons have failed

Added by David Piper about 3 years ago. Updated almost 2 years ago.

Status:

Triaged

Priority:

High

Assignee:

Patrick Donnelly

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

pacific,octopus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDSMonitor

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Seen on ceph v14.2.9 in a containerised cluster with 3 MDS nodes

Both standby MGR containers are manually stopped. ceph reports a sensible alarm:

With only 1 MDS remaining we have an alarm on ceph health:

health: HEALTH_WARN
            insufficient standby MDS daemons available

Then I manually stop the final, active MDS damon.

Expected:

`ceph health` should report an alarm that there are no active MDS daemons and all filesystems are degraded / inactive.

Actual:

`ceph health` continues to report "insufficent standby". There are no new alarms about the total lack of active MDS daemons.

health: HEALTH_WARN
            insufficient standby MDS daemons available

ceph status shows:

mds: cephfs:1 {0=albamons_sc2=up:active(laggy or crashed)}

If I then stop the active (and only remaining) MGR, we got an alarm reported on ceph health:

health: HEALTH_WARN
no active mgr

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by David Piper about 3 years ago

Sorry - please ignore the references to MGR in the description. The issue here is just with alarms about MDS when all MDS daemons are inactive.

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

Has duplicate Bug #49370: No alarm if all standby MDSs have failed added

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

Project changed from Ceph to CephFS

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Status changed from New to Triaged
Assignee set to Patrick Donnelly
Priority changed from Normal to High
Target version set to v17.0.0
Source set to Community (user)
Backport set to pacific,octopus
Component(FS) MDSMonitor added

Thanks for the report. That is indeed confusing. I think we will change it so laggy/dead daemons are still removed by the mons. That would generate the appropriate health warning.

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

Target version deleted (~~v17.0.0~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #49371

Misleading alarm if all MDS daemons have failed

Updated by David Piper about 3 years ago

Updated by Sebastian Wagner about 3 years ago

Updated by Sebastian Wagner about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Patrick Donnelly almost 2 years ago