Bug #4637
closed
mds: standby takeover stuck in rejoin
Added by Sam Lang about 11 years ago.
Updated almost 8 years ago.
Description
With current master, with one active mds and one standby, if the active fails, the standby gets stuck in rejoin while trying to go active. This is reproducible with vstart.sh -s, kill the active mds, the takeover mds gets stuck in rejoin.
- Status changed from New to Fix Under Review
Pushed a fix to wip-4637.
Can you try this patch instead, and see if that works? (If it does I'll want a review from Sage or Yan; it looks okay to me but there's a lot happening here so I may be missing something.)
diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
index 1fa0303..3b3b2d6 100644
--- a/src/mds/MDS.cc
+++ b/src/mds/MDS.cc
@@ -1551,6 +1551,10 @@ void MDS::handle_mds_recovery(int who)
void MDS::handle_mds_failure(int who)
{
+ if (who == whoami) {
+ dout(5) << "handle_mds_failure for myself; not doing anything" << dendl;
+ return;
+ }
dout(5) << "handle_mds_failure mds." << who << dendl;
mdcache->handle_mds_failure(who);
- Assignee changed from Sam Lang to Greg Farnum
Pushed that to wip-no-fail-whoami-4637. Sage, Yan, care to check it out? :)
Greg's fix looks good, sorry for the bug.
- Status changed from Fix Under Review to Resolved
Thanks. Don't you ever sleep? :)
Merged into master in commit:0d6ddd926432821842a7e40fdb78d793ab0737bb
Also available in: Atom
PDF