Project

General

Profile

Actions

Bug #40340

closed

kernel client stuck at opening forever after network outage.

Added by Xiaoxi Chen almost 5 years ago. Updated over 3 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

We dont retry(in ceph code) for sending create_session_open_msg. Thus once a network outage happen and mds evict the client, the client will try to reconnect and go through

static int __open_session(struct ceph_mds_client *mdsc,
              struct ceph_mds_session *session)

However, if the tcp connection cannot be established after tcp_syn_retries times. The TCP establishment will be given up and leaving the session_state to CEPH_MDS_SESSION_OPENING for ever. We dont retry anymore.

The only way out is remount, or fail-over the mds...

Can we check in __do_request, if the session stuck in opening for long enough, eg mdsc->mdsmap->m_session_timeout >> 2, we will do __open_session again?

current code

if (session->s_state == CEPH_MDS_SESSION_NEW ||
            session->s_state == CEPH_MDS_SESSION_CLOSING)
            __open_session(mdsc, session);

new code

renew_interval = mdsc->mdsmap->m_session_timeout >> 2;
if (session->s_state == CEPH_MDS_SESSION_NEW ||
            session->s_state == CEPH_MDS_SESSION_CLOSING ||
            session->s_state == CEPH_MDS_SESSION_OPENING &&
            time_after_eq(jiffies, HZ*renew_interval + session->s_renew_requested))
            __open_session(mdsc, session);

Actions

Also available in: Atom PDF