Bug #23658: MDSMonitor: crash after assigning standby-replay daemon in multifs setup - CephFS - Ceph

Actions

Copy link

Bug #23658

closed

MDSMonitor: crash after assigning standby-replay daemon in multifs setup

Added by Patrick Donnelly about 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Zheng Yan

Category:

Correctness/Safety

Target version:

Ceph - v13.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

luminous,jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDSMonitor, qa-suite

Labels (FS):

crash, multifs

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

From: https://github.com/rook/rook/issues/1027

2017-09-29 21:55:06.978169 I | rook-ceph-mon0:      0> 2017-09-29 21:55:06.961413 7f55aba29700 -1 /build/ceph/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f55aba29700 time 2017-09-29 21:55:06.957486
2017-09-29 21:55:06.978179 I | rook-ceph-mon0: /build/ceph/src/mds/FSMap.cc: 870: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)

It would appear there is an issue with the standby being assigned by the mon after adding a third filesystem. The configuration of the file systems in the cluster was:

    myfs: two mds active, two mds on standby-replay
    yourfs: three mds active, three mds on standby
    jaredsfs: one mds active, one mds on standby-replay

After the first two were created, ceph status showed the following mds status:

 mds: myfs-2/2/2 up yourfs-3/3/3 up  {[myfs:0]=msdfdx=up:active,[myfs:1]=m88104=up:active,[yourfs:0]=m739m0=up:active,[yourfs:1]=mdv8k2=up:active,[yourfs:2]=m6ktsw=up:active}, 2 up:standby-replay, 3 up:standby

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Labels (FS) crash added

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Blocks Feature #22477: multifs: remove multifs experimental warnings added

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Zheng Yan about 6 years ago

Backport set to luminous, jewel

https://github.com/ceph/ceph/pull/21510

Actions

Copy link

Updated by Zheng Yan about 6 years ago

Status changed from New to Fix Under Review

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Assignee set to Zheng Yan
Target version changed from v14.0.0 to v13.0.0

Actions

Copy link

Updated by Patrick Donnelly almost 6 years ago

Status changed from Fix Under Review to Pending Backport
Backport changed from luminous, jewel to luminous,jewel

Actions

Copy link

Updated by Nathan Cutler almost 6 years ago

Copied to Backport #23833: luminous: MDSMonitor: crash after assigning standby-replay daemon in multifs setup added

Actions

Copy link

Updated by Nathan Cutler almost 6 years ago

Copied to Backport #23834: jewel: MDSMonitor: crash after assigning standby-replay daemon in multifs setup added

Actions

Copy link

#10

Updated by Travis Nielsen almost 6 years ago

When this issue hits, is there a way to recover? For example, to forcefully remove the multiple filesystems that are causing the crash. With the mons crashing, the cluster is just down.

Actions

Copy link

#11

Updated by Patrick Donnelly over 5 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #23658

MDSMonitor: crash after assigning standby-replay daemon in multifs setup

Updated by Patrick Donnelly about 6 years ago

Updated by Patrick Donnelly about 6 years ago

Updated by Patrick Donnelly about 6 years ago

Updated by Zheng Yan about 6 years ago

Updated by Zheng Yan about 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Patrick Donnelly almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Travis Nielsen almost 6 years ago

Updated by Patrick Donnelly over 5 years ago