Bug #19437: fs：The mount point break off when mds switch hanppened. - CephFS - Ceph

Actions

Copy link

Bug #19437

closed

fs：The mount point break off when mds switch hanppened.

Added by Ivan Guan about 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

John Spray

Category:

Target version:

% Done:

Source:

Tags:

Backport:

jewel, kraken

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v10.2.2

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

My ceph version is jewel and my cluster have two nodes. I start two mds, one active and the other is hot-standby mode. Use ceph-fuse(libcephfs.so) to mount.
The hot-standby become active as we expected but the mount piont broken strangely when the active mds is down.

Analysis:
We all konw that client will create a session used for communicating with server, and we can list it use "ceph daemon mds.x session ls --cluster CLUSTER_NAME".
The hot-standby mds also replay the session to it's memery in sessiom_map, but i find that the session has gone when the hot-stanby mds took over the service
and use "ceph daemon mds.x session ls --cluster CLUSTER_NAME" can't list the session either.So i guess someone killed it.So, i add some code and log in function
find_idle_sessions to verify my thoughs.

void Server::find_idle_sessions() {
...
while (1) {
Session *session = mds->sessionmap.get_oldest_session(Session::STATE_STALE);
if (!session)
break;
if (session->is_importing()) {
dout(10) << "stopping at importing session " << session->info.inst << dendl;
break;
}
assert(session->is_stale());
if (session->last_cap_renew >= cutoff) {
dout(20) << "oldest stale session is " << session->info.inst << " and sufficiently new ("
<< session->last_cap_renew << ")" << dendl;
break;
}

utime_t age = now;
    age = session>last_cap_renew;
    mds->clog->info() << "closing stale session " << session->info.inst
    << " after " << age << "\n";
    dout(10) << "autoclosing stale session " << session->info.inst << " last " << session->last_cap_renew << dendl;
    kill_session(session, NULL);
  }
}

Experiment/reproduce
1.start two mds
590119: 192.168.10.9:6802/8738 'xt1' mds.0.25429 up:active seq 20
590120: 192.168.10.10:6802/10010 'xt2' mds.0.0 up:standby-replay seq 1

2.ceph-fuse mount a directory(ceph-fuse -m 192.168.10.9 /mnt/seven/ --cluster xtao )

3.do io test(dd/fio ...) for about 6 minutes,because the mds_session_autoclose is 300s.

4.kill the active mds,my cluster active mds is mds.xt1

5.we will found our mount point still is not available though the standby mds took over the service as it become active.

Let see my logs:

xtao-mds.xt2.log
2017-03-30 22:36:45.812142 7f99147d9700 1 mds.0.journal ESession::replay after get_or_add_session last_cap_renew is 2017-03-30 22:36:45.812140 session is client.590301 192.168.10.9:0/626254714 session addr 0x7f992c86b180

//the standby mds repaly the session from mds journal and set the tiem for last_cap_renew.

after 6 minutes, we kill the active mds,the standby mds will become active from standby.

2017-03-30 22:43:01.955607 7f9918be5700 0 log_channel(cluster) log [INF] : closing stale session client.590301 192.168.10.9:0/626254714 after 376.142601
//this log indicates the last_cap_renew time of client.590301 is behind about 376s now ,but the mds_sesion_autoclose time is 300s, so server will kill the session.

Look from the code, the designer may want to use clinet's renew_caps to update the last_cap_renews,but if the find_idle_sessions comes before the client renew_caps
will leads the session be killed and the mount point can't wrok.

Related issues 2 (0 open — 2 closed)