Project

General

Profile

Bug #4637

mds: standby takeover stuck in rejoin

Added by Sam Lang about 8 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With current master, with one active mds and one standby, if the active fails, the standby gets stuck in rejoin while trying to go active. This is reproducible with vstart.sh -s, kill the active mds, the takeover mds gets stuck in rejoin.

Associated revisions

Revision 0d6ddd92 (diff)
Added by Greg Farnum about 8 years ago

mds: do not go through handle_mds_failure for oneself

A standby MDS can attempt the handle_mds_failure paths for itself, if
it sees the transition from up to down. This leads it to insert itself
into the resolve_gather set, which is bad. So check if the failed MDS
is the same as whoami, and abort if so. This fixes #4637.

Signed-off-by: Greg Farnum <>
Reviewed-by: Yan, Zheng <>

History

#1 Updated by Sam Lang about 8 years ago

  • Status changed from New to Fix Under Review

Pushed a fix to wip-4637.

#2 Updated by Greg Farnum about 8 years ago

Can you try this patch instead, and see if that works? (If it does I'll want a review from Sage or Yan; it looks okay to me but there's a lot happening here so I may be missing something.)

diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
index 1fa0303..3b3b2d6 100644
--- a/src/mds/MDS.cc
+++ b/src/mds/MDS.cc
@@ -1551,6 +1551,10 @@ void MDS::handle_mds_recovery(int who)

 void MDS::handle_mds_failure(int who)
 {
+  if (who == whoami) {
+    dout(5) << "handle_mds_failure for myself; not doing anything" << dendl;
+    return;
+  }
   dout(5) << "handle_mds_failure mds." << who << dendl;

   mdcache->handle_mds_failure(who);

#3 Updated by Sam Lang about 8 years ago

  • Assignee changed from Sam Lang to Greg Farnum

#4 Updated by Greg Farnum about 8 years ago

Pushed that to wip-no-fail-whoami-4637. Sage, Yan, care to check it out? :)

#5 Updated by Zheng Yan about 8 years ago

Greg's fix looks good, sorry for the bug.

#6 Updated by Greg Farnum about 8 years ago

  • Status changed from Fix Under Review to Resolved

Thanks. Don't you ever sleep? :)

Merged into master in commit:0d6ddd926432821842a7e40fdb78d793ab0737bb

#7 Updated by Greg Farnum over 4 years ago

  • Component(FS) MDS added

Also available in: Atom PDF