Feature #61866: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings - CephFS - Ceph

Custom queries

Bug queue
Bug triage
CephFS Bug Triage
CephFS task-easy
CephFS: Available Easy Issues
CephFS: Documentation
Crash queue
Crash triage
Feedback
My issues
Need Review
Pending backports
Product Backlog Scrub
Release: Quincy: Backports (open)
Release: Reef: Backports (open)
Release: Squid: Backports (open)
Release: Squid: Open Issues
Release: Tentacle: Features
Release: Tentacle: Open Issues
Zee CephFS Ticket Well

Actions

Copy link

Feature #61866

open

MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings

Added by Patrick Donnelly 11 months ago. Updated 5 days ago.

Status:

Pending Backport

Priority:

Immediate

Assignee:

Rishabh Dave

Category:

Administration/Usability

Target version:

Ceph - v19.0.0

% Done:

Source:

Development

Tags:

backport_processed

Backport:

reef,quincy

Reviewed:

Affected Versions:

Component(FS):

MDSMonitor

Labels (FS):

Pull request ID:

56066

Description

If an MDS is already having issues with getting behind on trimming its journal or an oversized cache, restarting it may only create new problems with very slow recovery. In particular, if the MDS gets very behind on trimming its journal with 1M or more segments, replay can take hours or longer.

We already track these warnings in MDSMonitor so do a simple check to help the operator or support folks not shoot themselves in the foot.

Related issues 3 (3 open — 0 closed)

Related to CephFS - Bug #65841: qa: dead job from `tasks.cephfs.test_admin.TestFSFail.test_with_health_warn_oversize_cache`

Fix Under Review

Rishabh Dave

Actions

Copied to CephFS - Backport #65927: reef: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings

New

Rishabh Dave

Actions

Copied to CephFS - Backport #65928: quincy: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings

New

Rishabh Dave

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by Venky Shankar 11 months ago

Category set to Administration/Usability
Assignee set to Manish Yathnalli

Actions

Copy link

Updated by Venky Shankar 8 months ago

Priority changed from Urgent to Immediate

Manish, please take this one on prio.

Actions

Copy link

Updated by Manish Yathnalli 8 months ago

Status changed from New to In Progress

I will take a look Venky.

Actions

Copy link

Updated by Venky Shankar 3 months ago

Assignee changed from Manish Yathnalli to Venky Shankar
Backport changed from reef,quincy,pacific to reef,quincy

Actions

Copy link

Updated by Venky Shankar 2 months ago

Assignee changed from Venky Shankar to Rishabh Dave

Rishabh, please take this one.

Actions

Copy link

Updated by Patrick Donnelly 2 months ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 56066

Actions

Copy link

Updated by Rishabh Dave about 1 month ago

Patrick, should we include other health warnings too? I didn't include it in PR because it was mentioned on this ticket. Since Venky too brought this up here, I think it's worth discussing and writing a fix for it.

Copying Venky's comment below -


What about other health warnings?

enum mds_metric_t {
  MDS_HEALTH_NULL = 0,
  MDS_HEALTH_TRIM,
  MDS_HEALTH_CLIENT_RECALL,
  MDS_HEALTH_CLIENT_LATE_RELEASE,
  MDS_HEALTH_CLIENT_RECALL_MANY,
  MDS_HEALTH_CLIENT_LATE_RELEASE_MANY,
  MDS_HEALTH_CLIENT_OLDEST_TID,
  MDS_HEALTH_CLIENT_OLDEST_TID_MANY,
  MDS_HEALTH_DAMAGE,
  MDS_HEALTH_READ_ONLY,
  MDS_HEALTH_SLOW_REQUEST,
  MDS_HEALTH_CACHE_OVERSIZED,
  MDS_HEALTH_SLOW_METADATA_IO,
  MDS_HEALTH_CLIENTS_LAGGY,
  MDS_HEALTH_CLIENTS_LAGGY_MANY,
  MDS_HEALTH_DUMMY, // not a real health warning, for testing
};

Esp, MDS_HEALTH_SLOW_REQUEST - where the MDS could probably running close to its limits.

Actions

Copy link

Updated by Patrick Donnelly 16 days ago

Rishabh Dave wrote in #note-7:

Patrick, should we include other health warnings too? I didn't include it in PR because it was mentioned on this ticket. Since Venky too brought this up here, I think it's worth discussing and writing a fix for it.

Copying Venky's comment below -

[...]

So far as we know, the two main culprits for slow recovery are the ones included in your PR. That is our main concern with gating mds failover. I don't see a strong argument to include the others at this time.

Actions

Copy link

Updated by Rishabh Dave 12 days ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#10

Updated by Venky Shankar 8 days ago

Related to Bug #65841: qa: dead job from `tasks.cephfs.test_admin.TestFSFail.test_with_health_warn_oversize_cache` added

Actions

Copy link

#11

Updated by Casey Bodley 5 days ago

Copied to Backport #65927: reef: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings added

Actions

Copy link

#12

Updated by Casey Bodley 5 days ago

Copied to Backport #65928: quincy: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings added

Actions

Copy link

#13

Updated by Casey Bodley 5 days ago

Tags set to backport_processed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Feature #61866

MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings

Updated by Venky Shankar 11 months ago

Updated by Venky Shankar 8 months ago

Updated by Manish Yathnalli 8 months ago

Updated by Venky Shankar 3 months ago

Updated by Venky Shankar 2 months ago

Updated by Patrick Donnelly 2 months ago

Updated by Rishabh Dave about 1 month ago

Updated by Patrick Donnelly 16 days ago

Updated by Rishabh Dave 12 days ago

Updated by Venky Shankar 8 days ago

Updated by Casey Bodley 5 days ago

Updated by Casey Bodley 5 days ago

Updated by Casey Bodley 5 days ago