Project

General

Profile

Actions

Support #58635

open

MDS blacklists clients causing hang at mountpoint

Added by siva p about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:

Description

Hello,
We are running Ceph server on RHEL running kernel 4.18.0-348.el8.x86_64 with
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)

There are around 5 Clients connected which run on kernel 5.10
But i observe intermittent mount issue and it hangs for more than 10 minutes and connects back. During this time i noted that the particular client is blackisted by the server.
I would like to know what could be the reason for the server to blacklist a particular client and is there a timeout for recovery which cause it hung for more than 10 minutes?
Also is there a way to reduce the timeout?
Please let me know if there are any info is needed

Logs from the hung client below:
[12277.120315] ceph: mds0 caps stale
[12513.987362] ceph: mds0 hung
[13157.516794] libceph: mds0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6801 socket closed (con state OPEN)
[13157.996991] libceph: mds0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6801 connection reset
[13158.000466] libceph: reset on mds0
[13158.000654] ceph: mds0 closed our session
[13158.000664] ceph: mds0 reconnect start
[13158.004275] ceph: mds0 reconnect denied
[13158.005383] libceph: mds0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6801 socket closed (con state NEGOTIATING)
[13159.021372] ceph: mds0 rejected session
[13164.174089] libceph: mon0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6789 session established
[13164.174281] libceph: mon0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6789 socket closed (con state OPEN)
[13164.174448] libceph: mon0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6789 session lost, hunting for new mon
[13164.176660] libceph: mon2 (1)[fd0d:ed01:fffc:34f2:10:41:208:133]:6789 session established
[13164.177965] libceph: client229404 fsid 97a30083-08fd-4069-aa8c-47508a4e9c01

Actions #1

Updated by Ilya Dryomov about 1 year ago

  • Assignee set to Xiubo Li
Actions #2

Updated by Xiubo Li about 1 year ago

BTW, do you have the mds side logs ? Usually if the client couldn't respond to MDS for some reasons, such as the caps revocation is stuck for a long time, the MDS will blocklist the client.

We need to know what has happened in the client and mds side, why the client was added to the blocklist.

Actions #3

Updated by siva p about 1 year ago

2023-02-24T22:54:56.528+0000 7fa6567ba700 0 log_channel(cluster) log [WRN] : evicting unresponsive client zcu102-zynqmp:gateway (231426), after 302.21 seconds
2023-02-24T22:54:56.528+0000 7fa6567ba700 1 mds.0.3493 Evicting (and blocklisting) client session 231426 (v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325)
2023-02-24T22:54:56.528+0000 7fa6567ba700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 231426 (v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325)
2023-02-24T23:05:38.095+0000 7fa65afc3700 0 --1- [v2:[fd0d:ed01:fffc:34f2:10:41:208:131]:6800/4186446736,v1:[fd0d:ed01:fffc:34f2:10:41:208:131]:6801/4186446736] >> v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325 conn(0x55b719b16400 0x55b70d3fc800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
2023-02-24T23:05:38.101+0000 7fa657fbd700 0 mds.0.server ignoring msg from not-open sessionclient_reconnect(1 caps 2 realms ) v3
2023-02-24T23:05:38.101+0000 7fa65afc3700 0 --1- [v2:[fd0d:ed01:fffc:34f2:10:41:208:131]:6800/4186446736,v1:[fd0d:ed01:fffc:34f2:10:41:208:131]:6801/4186446736] >> v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325 conn(0x55b71ff52400 0x55b719b9c800 :6801 s=OPENED pgs=12 cs=1 l=0).fault server, going to standby
2023-02-24T23:05:38.102+0000 7fa65a7c2700 0 --1- [v2:[fd0d:ed01:fffc:34f2:10:41:208:131]:6800/4186446736,v1:[fd0d:ed01:fffc:34f2:10:41:208:131]:6801/4186446736] >> v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325 conn(0x55b717c2d800 0x55b718b9d000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing
2023-02-24T23:15:16.554+0000 7fa6567ba700 0 log_channel(cluster) log [WRN] : evicting unresponsive client j3sgi3vmobm01:gateway (228504), after 309.287 seconds
2023-02-24T23:15:16.554+0000 7fa6567ba700 1 mds.0.3493 Evicting (and blocklisting) client session 228504 ([fd0d:ed01:fffc:34f2:10:41:208:241]:0/1618805431)
2023-02-24T23:15:16.554+0000 7fa6567ba700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 228504 ([fd0d:ed01:fffc:34f2:10:41:208:241]:0/1618805431)

Actions #4

Updated by siva p about 1 year ago

sorry it took while to replicate the issue, this is what i see from mds log when this happened

Actions #5

Updated by siva p about 1 year ago

Also in the client log i provided before, i want to add 2 logs(in bold) which i did not add

[12239.232075] libceph: mon1 (1)[fd0d:ed01:fffc:34f2:10:41:208:132]:6789 session lost, hunting for new mon
[12239.233892] libceph: mon0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6789 session established

[12277.120315] ceph: mds0 caps stale
[12513.987362] ceph: mds0 hung

I wonder why the session was lost in this case, and tries to establish a new session. As far as i know, the ceph storage never went down. So the client should not lose the connection.

Actions #6

Updated by Xiubo Li about 1 year ago

siva p wrote:

2023-02-24T22:54:56.528+0000 7fa6567ba700 0 log_channel(cluster) log [WRN] : evicting unresponsive client zcu102-zynqmp:gateway (231426), after 302.21 seconds
2023-02-24T22:54:56.528+0000 7fa6567ba700 1 mds.0.3493 Evicting (and blocklisting) client session 231426 (v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325)
2023-02-24T22:54:56.528+0000 7fa6567ba700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 231426 (v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325)
2023-02-24T23:05:38.095+0000 7fa65afc3700 0 --1- [v2:[fd0d:ed01:fffc:34f2:10:41:208:131]:6800/4186446736,v1:[fd0d:ed01:fffc:34f2:10:41:208:131]:6801/4186446736] >> v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325 conn(0x55b719b16400 0x55b70d3fc800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
2023-02-24T23:05:38.101+0000 7fa657fbd700 0 mds.0.server ignoring msg from not-open sessionclient_reconnect(1 caps 2 realms ) v3
2023-02-24T23:05:38.101+0000 7fa65afc3700 0 --1- [v2:[fd0d:ed01:fffc:34f2:10:41:208:131]:6800/4186446736,v1:[fd0d:ed01:fffc:34f2:10:41:208:131]:6801/4186446736] >> v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325 conn(0x55b71ff52400 0x55b719b9c800 :6801 s=OPENED pgs=12 cs=1 l=0).fault server, going to standby
2023-02-24T23:05:38.102+0000 7fa65a7c2700 0 --1- [v2:[fd0d:ed01:fffc:34f2:10:41:208:131]:6800/4186446736,v1:[fd0d:ed01:fffc:34f2:10:41:208:131]:6801/4186446736] >> v1:[fd0d:ed01:fffc:34f2:a5a5:11f:fe19:a5a5]:0/2933979325 conn(0x55b717c2d800 0x55b718b9d000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing
2023-02-24T23:15:16.554+0000 7fa6567ba700 0 log_channel(cluster) log [WRN] : evicting unresponsive client j3sgi3vmobm01:gateway (228504), after 309.287 seconds
2023-02-24T23:15:16.554+0000 7fa6567ba700 1 mds.0.3493 Evicting (and blocklisting) client session 228504 ([fd0d:ed01:fffc:34f2:10:41:208:241]:0/1618805431)
2023-02-24T23:15:16.554+0000 7fa6567ba700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 228504 ([fd0d:ed01:fffc:34f2:10:41:208:241]:0/1618805431)

These logs just say the client was unresponsive, but still couldn't know the reason. Please upload the full mds logs.

Do you have the dmesg logs for the kclient ? Is there any other error or warning ?

Actions #7

Updated by Xiubo Li about 1 year ago

siva p wrote:

Also in the client log i provided before, i want to add 2 logs(in bold) which i did not add

[12239.232075] libceph: mon1 (1)[fd0d:ed01:fffc:34f2:10:41:208:132]:6789 session lost, hunting for new mon
[12239.233892] libceph: mon0 (1)[fd0d:ed01:fffc:34f2:10:41:208:131]:6789 session established

[12277.120315] ceph: mds0 caps stale
[12513.987362] ceph: mds0 hung

I wonder why the session was lost in this case, and tries to establish a new session. As far as i know, the ceph storage never went down. So the client should not lose the connection.

That usually means the mon1 just closed the session for some reasons, such as down or the network was disconnected.

Actions

Also available in: Atom PDF