Project

General

Profile

Actions

Feature #20606

closed

mds: improve usability of cluster rank manipulation and setting cluster up/down

Added by Patrick Donnelly almost 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
Administration/Usability
Target version:
% Done:

100%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDSMonitor
Labels (FS):
multimds
Pull request ID:

Description

Right now the procedure for bringing down a cluster is:

ceph fs set cephfs_a cluster_down 1
ceph mds fail 1:1 # rank 1 of 2
ceph mds fail 1:0 # rank 0 of 2
ceph status
  cluster:
    id:     4ef94796-a652-4e0f-ad4e-8f3aaa9b9d18
    health: HEALTH_ERR
            mds ranks 0,1 have failed
            mds cluster is degraded

  services:
    mon: 3 daemons, quorum a,b,c
    mgr: x(active)
    mds: 0/2/2 up, 2 up:standby, 2 failed
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   2 pools, 16 pgs
    objects: 39 objects, 3558 bytes
    usage:   3265 MB used, 27646 MB / 30911 MB avail
    pgs:     16 active+clean

This leaves the journal unflushed and client sessions half-open. Also, disturbing notices are in `ceph status` showing "failed" mdss and unhelpful health warnings.

I would recommend several changes outlined in this issue's sub-tasks.


Subtasks 5 (0 open5 closed)

Feature #20607: MDSMonitor: change "mds deactivate" to clearer "mds rejoin"RejectedDouglas Fuller07/12/2017

Actions
Feature #20608: MDSMonitor: rename `ceph fs set <fs_name> cluster_down` to `ceph fs set <fs_name> joinable`ResolvedDouglas Fuller07/12/2017

Actions
Feature #20609: MDSMonitor: add new command `ceph fs set <fs_name> down` to bring the cluster downResolvedDouglas Fuller07/12/2017

Actions
Feature #20610: MDSMonitor: add new command to shrink the cluster in an automated wayResolvedDouglas Fuller07/12/2017

Actions
Subtask #20864: kill allow_multimdsResolvedDouglas Fuller07/31/2017

Actions
Actions #1

Updated by Patrick Donnelly almost 7 years ago

  • Description updated (diff)
Actions #2

Updated by Patrick Donnelly almost 7 years ago

  • Subject changed from mds: allow cluster to be shut down gently and without warnings/errors to mds: improve usability of cluster rank manipulation and setting cluster up/down
  • Release set to master
Actions #3

Updated by John Spray almost 7 years ago

My thoughts on this:

  • maybe we should preface this class of command (manipulating the MDS ranks) with "cluster", so we'd have commands like "ceph fs cluster down", "ceph fs cluster set size", etc?
  • the 'deactivate' stuff is probably clearer if we re-cast it as operating on an FS rank rather than a daemon. So really we're saying "tear down this rank, whichever MDS daemon is holding it", rather than "MDS daemon xyz, please tear down the rank you hold". That might avoid the awkward naming of 'deactivate'.
  • I'm a bit fuzzy on the stuff here about bringing the cluster down, can't tell if it's about shrinking the cluster, or cleanly stopping daemons (to start them again later)?
Actions #4

Updated by Patrick Donnelly almost 7 years ago

John Spray wrote:

My thoughts on this:

  • maybe we should preface this class of command (manipulating the MDS ranks) with "cluster", so we'd have commands like "ceph fs cluster down", "ceph fs cluster set size", etc?

I like it!

  • the 'deactivate' stuff is probably clearer if we re-cast it as operating on an FS rank rather than a daemon. So really we're saying "tear down this rank, whichever MDS daemon is holding it", rather than "MDS daemon xyz, please tear down the rank you hold". That might avoid the awkward naming of 'deactivate'.

I was also thinking similarly: let's move `ceph mds` commands that operate on ranks to `ceph fs`.

  • I'm a bit fuzzy on the stuff here about bringing the cluster down, can't tell if it's about shrinking the cluster, or cleanly stopping daemons (to start them again later)?

I don't think I get your question. Can you rephrase?

Actions #5

Updated by John Spray almost 7 years ago

The last point about cluster down: looking at http://tracker.ceph.com/issues/20609, I'm not sure what the higher level goal is. If we wanted to free up daemons to do other work (while making this filesystem inaccessible), then that's what the existing "cluster down" does. If we wanted to deactivate ranks, then I'm not sure why we'd ever want to deactivate the last one.

Actions #6

Updated by Patrick Donnelly over 6 years ago

John Spray wrote:

The last point about cluster down: looking at http://tracker.ceph.com/issues/20609, I'm not sure what the higher level goal is. If we wanted to free up daemons to do other work (while making this filesystem inaccessible), then that's what the existing "cluster down" does. If we wanted to deactivate ranks, then I'm not sure why we'd ever want to deactivate the last one.

The idea is to provide a mechanism for cleanly bringing the cluster down. Admittedly this is not something we expect people to be doing except in extraordinary cases or in testing. However, I thought we had an opportunity to improve this while thinking about the related issues.

Actions #7

Updated by Patrick Donnelly over 6 years ago

  • Target version set to v13.0.0
Actions #8

Updated by Douglas Fuller about 6 years ago

  • Status changed from New to Fix Under Review
Actions #9

Updated by Patrick Donnelly about 6 years ago

  • Category changed from 90 to Administration/Usability
  • Status changed from Fix Under Review to Resolved
  • Labels (FS) multimds added
Actions

Also available in: Atom PDF