Feature #20611: MDSMonitor: do not show cluster health warnings for file system intentionally marked down - CephFS - Ceph

Actions

Copy link

Feature #20611

closed

MDSMonitor: do not show cluster health warnings for file system intentionally marked down

Added by Patrick Donnelly almost 7 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

Patrick Donnelly

Category:

Administration/Usability

Target version:

Ceph - v14.0.0

% Done:

Source:

Development

Tags:

Backport:

Reviewed:

Affected Versions:

Component(FS):

MDSMonitor

Labels (FS):

multifs

Pull request ID:

26012

Description

Here's what you see currently:

$ ceph fs set a down true
a marked down. 
$ ceph status
  cluster:
    id:     e3d43918-f643-442b-bacc-5a1c1d9a8a7a
    health: HEALTH_ERR
            1 filesystem is offline

  services:
    mon: 3 daemons, quorum a,b,c (age 100s)
    mgr: x(active, since 96s)
    mds: a-0/0/0 up , 3 up:standby
    osd: 3 osds: 3 up (since 64s), 3 in (since 64s)

  data:
    pools:   2 pools, 16 pgs
    objects: 22 objects, 2.2 KiB
    usage:   3.2 GiB used, 27 GiB / 30 GiB avail
    pgs:     16 active+clean

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Douglas Fuller almost 7 years ago

Taking an MDS down for hardware maintenance, etc, should trigger a health warning because such actions do, even if intentionally, degrade the MDS cluster.

I think we should show a warning here unless the user's clear intention was to permanently shrink the MDS cluster or remove the filesystem entirely. I think we should show:

HEALTH_WARN if there are fewer MDSs active than max_mds for a filesystem
HEALTH_ERR if there are no MDSs online for a filesystem

Maybe we could add some detail to the HEALTH_WARN telling the user what to do to remove the warning (decrease max_mds or delete the filesystem).

Actions

Copy link

Updated by Patrick Donnelly almost 7 years ago

Douglas Fuller wrote:

Taking an MDS down for hardware maintenance, etc, should trigger a health warning because such actions do, even if intentionally, degrade the MDS cluster.

One ERR message saying the file system is offline should be sufficient. The message should be clear which file system(s) is offline (rather than the MDS cluster).

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Category changed from 90 to Administration/Usability
Target version set to v13.0.0
Release deleted (~~master~~)

Actions

Copy link

Updated by Douglas Fuller about 6 years ago

See https://github.com/ceph/ceph/pull/16608, which implements the opposite of this behavior. Whenever a filesystem is marked down, data is inaccessible. That should be HEALTH_ERR, even if intentional.

Actions

Copy link

Updated by Douglas Fuller about 6 years ago

Status changed from New to Fix Under Review

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Status changed from Fix Under Review to New
Assignee deleted (~~Douglas Fuller~~)
Priority changed from High to Normal
Target version changed from v13.0.0 to v14.0.0
Parent task deleted (~~#20606~~)
Labels (FS) multifs added

Doug, I was just thinking about this and a valid reason to not want a HEALTH_ERR is if you have dozens or hundreds of ceph file systems, one for each "tenant"/user/use-case/application/whatever, but only activate them (i.e. assign MDSs) when that corresponding application is online.

This seemed to be a direction Rook wanted to go [1] but will not proceed with because multifs is not yet stable.

I propose we retarget this to 14.0.0 and add it to multifs.

[1] https://github.com/rook/rook/issues/1027

Actions

Copy link