Project

General

Profile

Actions

Feature #65637

open

mds: continue sending heartbeats during recovery when MDS journal is large

Added by Patrick Donnelly 10 days ago.

Status:
New
Priority:
Urgent
Assignee:
-
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
squid,reef
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

When the MDS reaches up:rejoin / up:resolve after spending a long time (hours) in up:replay, it often gets in an loop somewhere with the mds_lock. This causes it to miss heartbeat resets. Consequently, the beacon thread will stop sending beacons to the monitors.

Make the MDS smarter by:

- If replay took X time, lengthen the internal heartbeat grace period by some configurable factor during up:resolve/up:rejoin.
- Note in beacons a new health warning about long recovery during these states.


Related issues 2 (2 open0 closed)

Related to CephFS - Feature #61863: mds: issue a health warning with estimated time to complete replayFix Under ReviewVenky Shankar

Actions
Related to CephFS - Bug #65658: mds: MetricAggregator::ms_can_fast_dispatch2 acquires locksFix Under ReviewPatrick Donnelly

Actions
Actions #1

Updated by Patrick Donnelly 10 days ago

  • Related to Feature #61863: mds: issue a health warning with estimated time to complete replay added
Actions #2

Updated by Patrick Donnelly 9 days ago

  • Related to Bug #65658: mds: MetricAggregator::ms_can_fast_dispatch2 acquires locks added
Actions

Also available in: Atom PDF