Actions
Bug #51410
openkclient: fails to finish reconnect during MDS thrashing (testing branch)
Status:
New
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2021-06-26T03:53:10.735 INFO:tasks.mds_thrash.fs.[cephfs]:kill mds.a (rank=4) ... 2021-06-26T03:53:29.940 INFO:tasks.mds_thrash.fs.[cephfs]:mds map: { ... 2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Wrote -1 instead of 4096 bytes. 2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Probably out of disk space 2021-06-26T03:55:00.943 INFO:tasks.workunit.client.0.smithi075.stderr:write: Input/output error
From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/teuthology.log
mds.e took over for rank 4. We see this in the logs:
2021-06-26T03:53:34.065+0000 7f8a8801b700 1 --1- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] >> conn(0x557fd4f9e000 0x557fd4e25800 :6836 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=33 legacy v1:172.21.15.75:6836/4253739342 socket_addr v1:172.21.15.75:6836/4253739342 target_addr v1:192.168.0.1:49634/0 2021-06-26T03:53:34.065+0000 7f8a8801b700 10 mds.e parse_caps: parsing auth_cap_str='allow' 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e existing session 0x557fd4fc8500 for client.4764 v1:192.168.0.1:0/499264774 existing con 0, new/authorizing con 0x557fd4f9e000 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e parse_caps: parsing auth_cap_str='allow' 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e ms_handle_accept v1:192.168.0.1:0/499264774 con 0x557fd4f9e000 session 0x557fd4fc8500 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e session connection 0 -> 0x557fd4f9e000 2021-06-26T03:53:34.065+0000 7f8a85015700 1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 1 ==== client_session(request_open) v4 ==== 169+0+0 (unknown 1896856091 0 0) 0x557fd4128a80 con 0x557fd4f9e000 ... 2021-06-26T03:53:38.005+0000 7f8a85015700 1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 2 ==== client_caps(update ino 0x2000000022c 5 seq 0 tid 1806 caps=pAsLsXsFscb dirty=Fw wanted=- follows 1 mseq 1 size 104857600/0 mtime 2021-06-26T03:53:02.937619+0000 xattrs(v=18446744071724651093 l=0)) v10 ==== 236+0+0 (unknown 128463977 0 0) 0x557fd4f99200 con 0x557fd4f9e000 ... 2021-06-26T03:55:00.100+0000 7f8a83812700 7 mds.4.server reconnect timed out, 1 clients have not reconnected in time 2021-06-26T03:55:00.100+0000 7f8a83812700 1 mds.4.server reconnect gives up on client.4764 v1:192.168.0.1:0/499264774 2021-06-26T03:55:00.100+0000 7f8a83812700 0 log_channel(cluster) log [WRN] : evicting unresponsive client (4764), after waiting 65.4709 seconds during MDS startup 2021-06-26T03:55:00.100+0000 7f8a83812700 1 mds.4.29 Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774) 2021-06-26T03:55:00.100+0000 7f8a83812700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774) 2021-06-26T03:55:00.100+0000 7f8a83812700 4 mds.4.29 Preparing blocklist command... (wait=0) 2021-06-26T03:55:00.100+0000 7f8a83812700 4 mds.4.29 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/499264774"}
From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/log/ceph-mds.e.log.gz
Not clear what the client failed to finish doing to complete reconnect.
Actions