Project

General

Profile

Feature #20611

MDSMonitor: do not show cluster health warnings for file system intentionally marked down

Added by Patrick Donnelly over 1 year ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
Start date:
07/12/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDSMonitor
Labels (FS):
multifs
Pull request ID:

Description

Here's what you see currently:

ceph fs set cephfs_a cluster_down 1
ceph mds fail 1:1 # rank 1 of 2
ceph mds fail 1:0 # rank 0 of 2
ceph status
  cluster:
    id:     4ef94796-a652-4e0f-ad4e-8f3aaa9b9d18
    health: HEALTH_ERR
            mds ranks 0,1 have failed
            mds cluster is degraded

  services:
    mon: 3 daemons, quorum a,b,c
    mgr: x(active)
    mds: 0/2/2 up, 2 up:standby, 2 failed
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   2 pools, 16 pgs
    objects: 39 objects, 3558 bytes
    usage:   3265 MB used, 27646 MB / 30911 MB avail
    pgs:     16 active+clean

Related issues

Blocks fs - Feature #22477: multifs: remove experimental warnings New 12/19/2017

History

#1 Updated by Douglas Fuller over 1 year ago

Taking an MDS down for hardware maintenance, etc, should trigger a health warning because such actions do, even if intentionally, degrade the MDS cluster.

I think we should show a warning here unless the user's clear intention was to permanently shrink the MDS cluster or remove the filesystem entirely. I think we should show:

  • HEALTH_WARN if there are fewer MDSs active than max_mds for a filesystem
  • HEALTH_ERR if there are no MDSs online for a filesystem

Maybe we could add some detail to the HEALTH_WARN telling the user what to do to remove the warning (decrease max_mds or delete the filesystem).

#2 Updated by Patrick Donnelly over 1 year ago

Douglas Fuller wrote:

Taking an MDS down for hardware maintenance, etc, should trigger a health warning because such actions do, even if intentionally, degrade the MDS cluster.

One ERR message saying the file system is offline should be sufficient. The message should be clear which file system(s) is offline (rather than the MDS cluster).

#3 Updated by Patrick Donnelly 8 months ago

  • Category changed from multi-MDS to Administration/Usability
  • Target version set to v13.0.0
  • Release deleted (master)

#4 Updated by Douglas Fuller 8 months ago

See https://github.com/ceph/ceph/pull/16608, which implements the opposite of this behavior. Whenever a filesystem is marked down, data is inaccessible. That should be HEALTH_ERR, even if intentional.

#5 Updated by Douglas Fuller 8 months ago

  • Status changed from New to Need Review

#6 Updated by Patrick Donnelly 8 months ago

  • Status changed from Need Review to New
  • Assignee deleted (Douglas Fuller)
  • Priority changed from High to Normal
  • Target version changed from v13.0.0 to v14.0.0
  • Parent task deleted (#20606)
  • Labels (FS) multifs added

Doug, I was just thinking about this and a valid reason to not want a HEALTH_ERR is if you have dozens or hundreds of ceph file systems, one for each "tenant"/user/use-case/application/whatever, but only activate them (i.e. assign MDSs) when that corresponding application is online.

This seemed to be a direction Rook wanted to go [1] but will not proceed with because multifs is not yet stable.

I propose we retarget this to 14.0.0 and add it to multifs.

[1] https://github.com/rook/rook/issues/1027

#7 Updated by Patrick Donnelly 6 months ago

Also available in: Atom PDF