https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-02-01T19:59:49ZCeph CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=853952017-02-01T19:59:49ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Rejected</i></li></ul><p>Clients which lose connectivity to the MDS are evicted after a timeout given by the "mds session timeout" setting. Evicting a client may cause the client to get stuck, as it can't necessarily get back into a healthy state.</p>
<p>Presumably you are reproducing this artificially because you encountered it in practice somewhere; the solution in practice is to either increase that interval, or fix whatever is causing the loss of contact to begin with.</p>
<p>There are changes coming in the future that will make this functionality a bit friendlier to make the client cope more gracefully with being evicted, only evict clients when they're holding up other clients.</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=854672017-02-01T22:30:11ZHenrik Korkucbugs@kirneh.eu
<ul></ul><p>Yes, I am reproducing it artificially after short network blip caused permanent mount point hangs of multiple servers, sending load avg of these servers to the sky by periodic mountpoint stats (monitoring system).</p>
<p>If ceph-fuse cannot handle session reset and only way to recover is to manually SIGKILL ceph-fuse and remount, why doesn't ceph-fuse just kill itself? In my opinion erroring out a user would be much more friendlier than just hanging it permanently.</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=854962017-02-01T23:14:02ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Status</strong> changed from <i>Rejected</i> to <i>New</i></li></ul><p>Hmm, now that I actually read the log (like a reasonable person :-)) it is a little bit strange that the server is sending the client client_session(stale) messages (not close messages) but then apparently dropping the client's getattrs.</p>
<p>Could you upload the MDS log from the same period, and mark the times at which the connection was interrupted and then restored?</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=855122017-02-02T08:25:47ZHenrik Korkucbugs@kirneh.eu
<ul><li><strong>File</strong> <a href="/attachments/download/2685/ceph-client.cephfs.log">ceph-client.cephfs.log</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/2685/ceph-client.cephfs.log">View</a> added</li><li><strong>File</strong> <a href="/attachments/download/2686/ceph-mds.henrik-eu2.log.gz">ceph-mds.henrik-eu2.log.gz</a> added</li></ul><p>Attaching client and MDS logs. This time I was mounting from another server and firewalling both input and output to make sure everything is isolated. Final outcome didn't change</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=855292017-02-02T09:53:45ZHenrik Korkucbugs@kirneh.eu
<ul></ul><p>forgot to add a timeline:<br />08:08:29 mounted<br />08:09:54 iptables up<br />08:10:50 ls stuck<br />08:16:02 iptables down<br />08:17:56 tried "ls -f"</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=855942017-02-03T12:30:44ZHenrik Korkucbugs@kirneh.eu
<ul></ul><p>Comparing logs I noticed that MDS clock is ~30s behind client. ntpd was dead on one of test servers... Will try to compensate in notes</p>
<p>~08:10:56 MDS decides client is stale:<br /><pre>
2017-02-02 08:10:26.538843 7f3b0042f700 10 mds.0.server new stale session client.4151 10.194.0.100:0/4006638222 last 2017-02-02 08:09:21.880119
</pre></p>
<p>~08:14:56 client session is closed:<br /><pre>
2017-02-02 08:14:26.545466 7f3b0042f700 0 log_channel(cluster) log [INF] : closing stale session client.4151 10.194.0.100:0/4006638222 after 304.665332
2017-02-02 08:14:26.545476 7f3b0042f700 10 mds.0.server autoclosing stale session client.4151 10.194.0.100:0/4006638222 last 2017-02-02 08:09:21.880119
</pre></p>
<p>~08:16:10 new session is created, resetsession sent:<br /><pre>
2017-02-02 08:15:41.931194 7f3afdb29700 10 mds.client.cephfs new session 0x55c844270680 for client.4151 10.194.0.100:0/4006638222 con 0x55c8445b8180
<..>
2017-02-02 08:15:41.931248 7f3afdb29700 0 -- 10.194.0.189:6816/31695 >> 10.194.0.100:0/4006638222 pipe(0x55c84441e800 sd=18 :6816 s=0 pgs=0 cs=0 l=0 c=0x55c8445b8180).accept we re
set (peer sent cseq 2), sending RESETSESSION
</pre></p>
<p>client get's reset session, marks it's session as stale.<br /><pre>
2017-02-02 08:16:10.810857 7f8cc4ff9700 0 client.4151 ms_handle_remote_reset on 10.194.0.189:6816/31695
2017-02-02 08:16:10.810869 7f8cc4ff9700 1 client.4151 reset from mds we were open; mark session as stale
</pre></p>
<p>08:16:31 client requests for caps, get's ignored because it's session is closed. Happens multiple times.<br />client: <br /><pre>
2017-02-02 08:16:31.016380 7f8ccc5ef700 10 -- 10.194.0.100:0/4006638222 >> 10.194.0.189:6816/31695 pipe(0x5616c998ca10 sd=0 :33785 s=2 pgs=7 cs=1 l=0 c=0x5616c998ad60).reader got ack seq 742216328 >= 742216328 on 0x7f8c9c018040 client_session(request_renewcaps seq 26) v1
</pre><br />MDS:<br /><pre>
2017-02-02 08:16:01.932965 7f3b02d35700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 26) v1 from client.4151
2017-02-02 08:16:01.932969 7f3b02d35700 10 mds.0.server ignoring renewcaps on non open|stale session (closed)
</pre></p>
<p>Looking at src/client/Client.cc I think that ms_handle_remote_reset() "MetaSession::STATE_OPEN" case of state switch should reopen session instead of marking it stale. I could try making a diff moving case to be together with MetaSession::STATE_OPENING as it looks like it does the same. What do you think? Would session close/open have some other side effects for fuse client in open state?</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=863182017-02-19T09:50:39ZHenrik Korkucbugs@kirneh.eu
<ul></ul><p>I created <a class="external" href="https://github.com/ceph/ceph/pull/13522">https://github.com/ceph/ceph/pull/13522</a></p>
<p>This resolves hang and allows work with mountpoint in this test case. I am just not sure if it is proper way to do a reconnect and if there are any additional side effects.</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=866182017-02-23T09:42:28ZZheng Yanukernel@gmail.com
<ul></ul><p>you can use 'ceph daemon client.xxx kick_stale_sessions' to recover this issue. Maybe we should add config option to decide if automatic reconnect is desired.</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=867972017-02-27T07:56:39ZHenrik Korkucbugs@kirneh.eu
<ul></ul><p>I updated PR to do _closed_mds_session(s).</p>
<p>As for config option, I would expect client to reconnect automagically after connection loss (it is self-healing after all).</p>
<p>On the other hand, I can do config option too if it is considered a better way to handle this change.</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=894742017-04-15T18:45:27ZJohn Sprayjcspray@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>jewel, kraken</i></li></ul><p>Let's backport this for the benefit of people running cephfs today</p> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=896472017-04-18T19:38:45ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/19677">Backport #19677</a>: jewel: Jewel ceph-fuse does not recover after lost connection to MDS</i> added</li></ul> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=896492017-04-18T19:38:47ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/19678">Backport #19678</a>: kraken: Jewel ceph-fuse does not recover after lost connection to MDS</i> added</li></ul> CephFS - Bug #18757: Jewel ceph-fuse does not recover after lost connection to MDShttps://tracker.ceph.com/issues/18757?journal_id=954732017-07-19T21:00:41ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>