Actions
Bug #4637
closedmds: standby takeover stuck in rejoin
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
With current master, with one active mds and one standby, if the active fails, the standby gets stuck in rejoin while trying to go active. This is reproducible with vstart.sh -s, kill the active mds, the takeover mds gets stuck in rejoin.
Updated by Sam Lang about 11 years ago
- Status changed from New to Fix Under Review
Pushed a fix to wip-4637.
Updated by Greg Farnum about 11 years ago
Can you try this patch instead, and see if that works? (If it does I'll want a review from Sage or Yan; it looks okay to me but there's a lot happening here so I may be missing something.)
diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc index 1fa0303..3b3b2d6 100644 --- a/src/mds/MDS.cc +++ b/src/mds/MDS.cc @@ -1551,6 +1551,10 @@ void MDS::handle_mds_recovery(int who) void MDS::handle_mds_failure(int who) { + if (who == whoami) { + dout(5) << "handle_mds_failure for myself; not doing anything" << dendl; + return; + } dout(5) << "handle_mds_failure mds." << who << dendl; mdcache->handle_mds_failure(who);
Updated by Sam Lang about 11 years ago
- Assignee changed from Sam Lang to Greg Farnum
Updated by Greg Farnum about 11 years ago
Pushed that to wip-no-fail-whoami-4637. Sage, Yan, care to check it out? :)
Updated by Zheng Yan about 11 years ago
Greg's fix looks good, sorry for the bug.
Updated by Greg Farnum about 11 years ago
- Status changed from Fix Under Review to Resolved
Thanks. Don't you ever sleep? :)
Merged into master in commit:0d6ddd926432821842a7e40fdb78d793ab0737bb
Actions