Bug #47881: mon/MDSMonitor: stop all MDS processes in the cluster at the same time. Some MDS cannot enter the "failed" state - CephFS - Ceph

Actions

Copy link

Bug #47881

closed

mon/MDSMonitor: stop all MDS processes in the cluster at the same time. Some MDS cannot enter the "failed" state

Added by wei qiaomiao over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

wei qiaomiao

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

nautilus,octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDSMonitor

Labels (FS):

Pull request ID:

37702

Crash signature (v1):

Crash signature (v2):

Description

Stop all MDS processes in the cluster at the same time, After all MDS processes exits, some MDS are still in the "active(laggy)" or "resolve(laggy)"state through the "ceph fs status" command.

Logs as follow:

2020-10-16 16:14:27.629 7f1f7ac52700 5 mon.host-192-168-9-2@0(leader).mds e962 preprocess_beacon mdsbeacon(48335776/host-192-168-9-4-9 down:dne seq 10044 v961) v7 from mds.? [v2:100.100.8.4:6842/2091715094,v1:100.100.8.4:6843/2091715094] compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2020-10-16 16:14:27.629 7f1f7ac52700 10 mon.host-192-168-9-2@0(leader).mds e962 preprocess_beacon: GID exists in map: 48335776
2020-10-16 16:14:27.629 7f1f7ac52700 10 mon.host-192-168-9-2@0(leader).mds e962 mds_beacon mdsbeacon(48335776/host-192-168-9-4-9 down:dne seq 10044 v961) v7 ignoring requested state, because mds hasn't seen latest map
2020-10-16 16:14:27.629 7f1f7ac52700 5 mon.host-192-168-9-2@0(leader).mds e962 _note_beacon mdsbeacon(48335776/host-192-168-9-4-9 down:dne seq 10044 v961) v7 noting time
2020-10-16 16:14:27.629 7f1f7ac52700 2 mon.host-192-168-9-2@0(leader) e1 send_reply 0x55c91e02e410 0x55c91e524000 mdsbeacon(48335776/host-192-168-9-4-9 down:dne seq 10044 v962) v7
2020-10-16 16:14:27.629 7f1f7ac52700 15 mon.host-192-168-9-2@0(leader) e1 send_reply routing reply to v2:100.100.8.4:6842/2091715094 via v2:100.100.8.3:3300/0 for request mdsbeacon(48335776/host-192-168-9-4-9 down:dne seq 10044 v961) v7

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Zheng Yan over 3 years ago

this is by design. monitor never marks laggy mds failed if there is no replacement

Actions

Copy link

Updated by wei qiaomiao over 3 years ago

Zheng Yan wrote:

this is by design. monitor never marks laggy mds failed if there is no replacement

Pull Requets is: https://github.com/ceph/ceph/pull/37702

Actions

Copy link

Updated by Kefu Chai over 3 years ago

Pull request ID set to 37702

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from New to Need More Info
Assignee set to wei qiaomiao

Would `ceph fs fail <fs_name>` not be the command you want?

Actions

Copy link

Updated by wei qiaomiao over 3 years ago

Patrick Donnelly wrote:

Would `ceph fs fail <fs_name>` not be the command you want?

"ceph mds fail <role_or_gid>" can mark the mds failed.

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from Need More Info to Fix Under Review
Target version set to v16.0.0
Backport set to nautilus,octopus
Component(FS) MDSMonitor added
Component(FS) deleted (~~MDS~~)
Labels (FS) deleted (~~multimds~~)

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Subject changed from multimds: stop all MDS processes in the cluster at the same time. Some MDS cannot enter the "failed" state to mon/MDSMonitor: stop all MDS processes in the cluster at the same time. Some MDS cannot enter the "failed" state
Status changed from Fix Under Review to Pending Backport
Source set to Community (dev)

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Copied to Backport #47957: nautilus: mon/MDSMonitor: stop all MDS processes in the cluster at the same time. Some MDS cannot enter the "failed" state added

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Copied to Backport #47958: octopus: mon/MDSMonitor: stop all MDS processes in the cluster at the same time. Some MDS cannot enter the "failed" state added

Actions

Copy link

#10

Updated by Nathan Cutler over 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #47881

mon/MDSMonitor: stop all MDS processes in the cluster at the same time. Some MDS cannot enter the "failed" state

Updated by Zheng Yan over 3 years ago

Updated by wei qiaomiao over 3 years ago

Updated by Kefu Chai over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by wei qiaomiao over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Nathan Cutler over 3 years ago

Updated by Nathan Cutler over 3 years ago

Updated by Nathan Cutler over 3 years ago