Project

General

Profile

Bug #19437

fs:The mount point break off when mds switch hanppened.

Added by Ivan Guan about 2 months ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
03/30/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
jewel
Component(FS):
Needs Doc:
No

Description

My ceph version is jewel and my cluster have two nodes. I start two mds, one active and the other is hot-standby mode. Use ceph-fuse(libcephfs.so) to mount.
The hot-standby become active as we expected but the mount piont broken strangely when the active mds is down.

Analysis:
We all konw that client will create a session used for communicating with server, and we can list it use "ceph daemon mds.x session ls --cluster CLUSTER_NAME".
The hot-standby mds also replay the session to it's memery in sessiom_map, but i find that the session has gone when the hot-stanby mds took over the service
and use "ceph daemon mds.x session ls --cluster CLUSTER_NAME" can't list the session either.So i guess someone killed it.So, i add some code and log in function
find_idle_sessions to verify my thoughs.

void Server::find_idle_sessions() {
...
while (1) {
Session *session = mds->sessionmap.get_oldest_session(Session::STATE_STALE);
if (!session)
break;
if (session->is_importing()) {
dout(10) << "stopping at importing session " << session->info.inst << dendl;
break;
}
assert(session->is_stale());
if (session->last_cap_renew >= cutoff) {
dout(20) << "oldest stale session is " << session->info.inst << " and sufficiently new ("
<< session->last_cap_renew << ")" << dendl;
break;
}

utime_t age = now;
age -= session->last_cap_renew;
mds->clog->info() << "closing stale session " << session->info.inst
<< " after " << age << "\n";

dout(10) << "autoclosing stale session " << session->info.inst << " last " << session->last_cap_renew << dendl;
kill_session(session, NULL);
}
}

Experiment/reproduce
1.start two mds
590119: 192.168.10.9:6802/8738 'xt1' mds.0.25429 up:active seq 20
590120: 192.168.10.10:6802/10010 'xt2' mds.0.0 up:standby-replay seq 1

2.ceph-fuse mount a directory(ceph-fuse -m 192.168.10.9 /mnt/seven/ --cluster xtao )

3.do io test(dd/fio ...) for about 6 minutes,because the mds_session_autoclose is 300s.

4.kill the active mds,my cluster active mds is mds.xt1

5.we will found our mount point still is not available though the standby mds took over the service as it become active.

Let see my logs:

xtao-mds.xt2.log
2017-03-30 22:36:45.812142 7f99147d9700 1 mds.0.journal ESession::replay after get_or_add_session last_cap_renew is 2017-03-30 22:36:45.812140 session is client.590301 192.168.10.9:0/626254714 session addr 0x7f992c86b180

//the standby mds repaly the session from mds journal and set the tiem for last_cap_renew.

after 6 minutes, we kill the active mds,the standby mds will become active from standby.

2017-03-30 22:43:01.955607 7f9918be5700 0 log_channel(cluster) log [INF] : closing stale session client.590301 192.168.10.9:0/626254714 after 376.142601
//this log indicates the last_cap_renew time of client.590301 is behind about 376s now ,but the mds_sesion_autoclose time is 300s, so server will kill the session.

Look from the code, the designer may want to use clinet's renew_caps to update the last_cap_renews,but if the find_idle_sessions comes before the client renew_caps
will leads the session be killed and the mount point can't wrok.


Related issues

Copied to Backport #19666: jewel: fs:The mount point break off when mds switch hanppened. In Progress
Copied to Backport #19667: kraken: fs:The mount point break off when mds switch hanppened. New

History

#1 Updated by Nathan Cutler about 2 months ago

  • Tracker changed from Tasks to Support
  • Project changed from Stable releases to fs

#2 Updated by John Spray about 2 months ago

  • Tracker changed from Support to Bug
  • Status changed from New to Need Review
  • Priority changed from Urgent to High
  • Regression set to No
  • Severity set to 3 - minor

#3 Updated by Greg Farnum about 2 months ago

  • Assignee changed from Greg Farnum to John Spray

#4 Updated by John Spray about 1 month ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel, kraken

#5 Updated by Nathan Cutler about 1 month ago

  • Copied to Backport #19666: jewel: fs:The mount point break off when mds switch hanppened. added

#6 Updated by Nathan Cutler about 1 month ago

  • Copied to Backport #19667: kraken: fs:The mount point break off when mds switch hanppened. added

Also available in: Atom PDF