Project

General

Profile

Actions

Bug #65647

open

Evicted kernel client may get stuck after reconnect

Added by Mykola Golub 25 days ago. Updated 3 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Our customer were observing sporadic "client isn't responding to mclientcaps(revoke)" issue so they configured auto eviction without blocklisting:

mds               advanced  mds_cap_revoke_eviction_timeout        600.000000                                                                                            
mds               advanced  mds_session_blocklist_on_evict         false                                                                                                 
mds               advanced  mds_session_blocklist_on_timeout       false                                                                                                 

It works the most of times: after eviction the client opens a new connection and k8s pods may successfully bind to the volume. But sometimes the client gets stuck: the pods fail to bind to the volume until the host is reboot, although on the mds side the session is seen in "open" state and looks like just an idle. If we do a manual evict for such session (ceph tell mds.0 session kill id) it helps and the pods may bind to the volume again.

From the mds log it looks like for these sporadic cases we always observe "denied reconnect attempt" message, like below:

Apr 08 14:35:03 ceph04 ceph-mds[48012]: mds.0.server evict_cap_revoke_non_responders: evicting cap revoke non-responder client id 19707773
Apr 08 14:35:03 ceph04 ceph-mds[48012]: mds.0.63169 Evicting client session 19707773 (v1:192.168.100.45:0/4254479370)
Apr 08 14:35:03 ceph04 ceph-mds[48012]: log_channel(cluster) log [INF] : Evicting client session 19707773 (v1:192.168.100.45:0/4254479370)
Apr 08 14:35:10 ceph04 ceph-mds[48012]: mds.0.server no longer in reconnect state, ignoring reconnect, sending close
Apr 08 14:35:10 ceph04 ceph-mds[48012]: log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.19707773 v1:192.168.100.45:0/4254479370 after 1.68543e+07 (allowed interval 45)

Note, it is just some lines from the logs, for illustration. The full log is attached.

So, normally, when a client successfully reconnects we may see only "no longer in reconnect state, ignoring reconnect, sending close" messages, which according to the code are when the client sends "reconnect" message while the session is in close state.

But "denied reconnect attempt" message seems to be possible only when the session is in open state. My interpretation of this (although may be incorrect) is the following. Normally (when it works) the sequence looks like the following:

1) on eviction, the mds closes the session
2) the client notices this and sends "reconnect" message
3) the mds sees the session in close state and "ignores reconnect, sending close".
4) the client sends "open" and the session becomes "open" again.

And in the problematic case to me it looks like after (4) the mds receives "reconnect" message from (2) again (a resent dup?) and this makes it get stuck.

The client kernel version:

"kernel_version": "5.15.0-92-generic",

The mds version:

"ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)"

Files

Actions

Also available in: Atom PDF