Actions
Bug #40340
closedkernel client stuck at opening forever after network outage.
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
We dont retry(in ceph code) for sending create_session_open_msg. Thus once a network outage happen and mds evict the client, the client will try to reconnect and go through
static int __open_session(struct ceph_mds_client *mdsc,
struct ceph_mds_session *session)
However, if the tcp connection cannot be established after tcp_syn_retries times. The TCP establishment will be given up and leaving the session_state to CEPH_MDS_SESSION_OPENING for ever. We dont retry anymore.
The only way out is remount, or fail-over the mds...
Can we check in __do_request, if the session stuck in opening for long enough, eg mdsc->mdsmap->m_session_timeout >> 2, we will do __open_session again?
current code
if (session->s_state == CEPH_MDS_SESSION_NEW ||
session->s_state == CEPH_MDS_SESSION_CLOSING)
__open_session(mdsc, session);
new code
renew_interval = mdsc->mdsmap->m_session_timeout >> 2;
if (session->s_state == CEPH_MDS_SESSION_NEW ||
session->s_state == CEPH_MDS_SESSION_CLOSING ||
session->s_state == CEPH_MDS_SESSION_OPENING &&
time_after_eq(jiffies, HZ*renew_interval + session->s_renew_requested))
__open_session(mdsc, session);
Actions