Project

General

Profile

Bug #23658

MDSMonitor: crash after assigning standby-replay daemon in multifs setup

Added by Patrick Donnelly 11 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
Start date:
04/11/2018
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
luminous,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor, qa-suite
Labels (FS):
crash, multifs
Pull request ID:

Description

From: https://github.com/rook/rook/issues/1027

2017-09-29 21:55:06.978169 I | rook-ceph-mon0:      0> 2017-09-29 21:55:06.961413 7f55aba29700 -1 /build/ceph/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f55aba29700 time 2017-09-29 21:55:06.957486
2017-09-29 21:55:06.978179 I | rook-ceph-mon0: /build/ceph/src/mds/FSMap.cc: 870: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)

It would appear there is an issue with the standby being assigned by the mon after adding a third filesystem. The configuration of the file systems in the cluster was:

    myfs: two mds active, two mds on standby-replay
    yourfs: three mds active, three mds on standby
    jaredsfs: one mds active, one mds on standby-replay

After the first two were created, ceph status showed the following mds status:

 mds: myfs-2/2/2 up yourfs-3/3/3 up  {[myfs:0]=msdfdx=up:active,[myfs:1]=m88104=up:active,[yourfs:0]=m739m0=up:active,[yourfs:1]=mdv8k2=up:active,[yourfs:2]=m6ktsw=up:active}, 2 up:standby-replay, 3 up:standby


Related issues

Blocks fs - Feature #22477: multifs: remove experimental warnings New 12/19/2017
Copied to fs - Backport #23833: luminous: MDSMonitor: crash after assigning standby-replay daemon in multifs setup Resolved
Copied to fs - Backport #23834: jewel: MDSMonitor: crash after assigning standby-replay daemon in multifs setup Rejected

History

#1 Updated by Patrick Donnelly 11 months ago

  • Labels (FS) crash added

#2 Updated by Patrick Donnelly 11 months ago

#3 Updated by Patrick Donnelly 11 months ago

  • Priority changed from Normal to Urgent

#4 Updated by Zheng Yan 11 months ago

  • Backport set to luminous, jewel

#5 Updated by Zheng Yan 11 months ago

  • Status changed from New to Need Review

#6 Updated by Patrick Donnelly 11 months ago

  • Assignee set to Zheng Yan
  • Target version changed from v14.0.0 to v13.0.0

#7 Updated by Patrick Donnelly 11 months ago

  • Status changed from Need Review to Pending Backport
  • Backport changed from luminous, jewel to luminous,jewel

#8 Updated by Nathan Cutler 11 months ago

  • Copied to Backport #23833: luminous: MDSMonitor: crash after assigning standby-replay daemon in multifs setup added

#9 Updated by Nathan Cutler 11 months ago

  • Copied to Backport #23834: jewel: MDSMonitor: crash after assigning standby-replay daemon in multifs setup added

#10 Updated by Travis Nielsen 11 months ago

When this issue hits, is there a way to recover? For example, to forcefully remove the multiple filesystems that are causing the crash. With the mons crashing, the cluster is just down.

#11 Updated by Patrick Donnelly 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF