Project

General

Profile

Actions

Bug #4637

closed

mds: standby takeover stuck in rejoin

Added by Sam Lang about 11 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With current master, with one active mds and one standby, if the active fails, the standby gets stuck in rejoin while trying to go active. This is reproducible with vstart.sh -s, kill the active mds, the takeover mds gets stuck in rejoin.

Actions #1

Updated by Sam Lang about 11 years ago

  • Status changed from New to Fix Under Review

Pushed a fix to wip-4637.

Actions #2

Updated by Greg Farnum about 11 years ago

Can you try this patch instead, and see if that works? (If it does I'll want a review from Sage or Yan; it looks okay to me but there's a lot happening here so I may be missing something.)

diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
index 1fa0303..3b3b2d6 100644
--- a/src/mds/MDS.cc
+++ b/src/mds/MDS.cc
@@ -1551,6 +1551,10 @@ void MDS::handle_mds_recovery(int who)

 void MDS::handle_mds_failure(int who)
 {
+  if (who == whoami) {
+    dout(5) << "handle_mds_failure for myself; not doing anything" << dendl;
+    return;
+  }
   dout(5) << "handle_mds_failure mds." << who << dendl;

   mdcache->handle_mds_failure(who);
Actions #3

Updated by Sam Lang about 11 years ago

  • Assignee changed from Sam Lang to Greg Farnum
Actions #4

Updated by Greg Farnum about 11 years ago

Pushed that to wip-no-fail-whoami-4637. Sage, Yan, care to check it out? :)

Actions #5

Updated by Zheng Yan about 11 years ago

Greg's fix looks good, sorry for the bug.

Actions #6

Updated by Greg Farnum about 11 years ago

  • Status changed from Fix Under Review to Resolved

Thanks. Don't you ever sleep? :)

Merged into master in commit:0d6ddd926432821842a7e40fdb78d793ab0737bb

Actions #7

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF