Project

General

Profile

Actions

Bug #51410

open

kclient: fails to finish reconnect during MDS thrashing (testing branch)

Added by Patrick Donnelly almost 3 years ago. Updated almost 2 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-26T03:53:10.735 INFO:tasks.mds_thrash.fs.[cephfs]:kill mds.a (rank=4)
...
2021-06-26T03:53:29.940 INFO:tasks.mds_thrash.fs.[cephfs]:mds map: {
...
2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Wrote -1 instead of 4096 bytes.
2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Probably out of disk space
2021-06-26T03:55:00.943 INFO:tasks.workunit.client.0.smithi075.stderr:write: Input/output error

From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/teuthology.log

mds.e took over for rank 4. We see this in the logs:

2021-06-26T03:53:34.065+0000 7f8a8801b700  1 --1- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] >>  conn(0x557fd4f9e000 0x557fd4e25800 :6836 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=33 legacy v1:172.21.15.75:6836/4253739342 socket_addr v1:172.21.15.75:6836/4253739342 target_addr v1:192.168.0.1:49634/0
2021-06-26T03:53:34.065+0000 7f8a8801b700 10 mds.e parse_caps: parsing auth_cap_str='allow'
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e  existing session 0x557fd4fc8500 for client.4764 v1:192.168.0.1:0/499264774 existing con 0, new/authorizing con 0x557fd4f9e000
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e parse_caps: parsing auth_cap_str='allow'
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e ms_handle_accept v1:192.168.0.1:0/499264774 con 0x557fd4f9e000 session 0x557fd4fc8500
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e  session connection 0 -> 0x557fd4f9e000
2021-06-26T03:53:34.065+0000 7f8a85015700  1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 1 ==== client_session(request_open) v4 ==== 169+0+0 (unknown 1896856091 0 0) 0x557fd4128a80 con 0x557fd4f9e000
...
2021-06-26T03:53:38.005+0000 7f8a85015700  1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 2 ==== client_caps(update ino 0x2000000022c 5 seq 0 tid 1806 caps=pAsLsXsFscb dirty=Fw wanted=- follows 1 mseq 1 size 104857600/0 mtime 2021-06-26T03:53:02.937619+0000 xattrs(v=18446744071724651093 l=0)) v10 ==== 236+0+0 (unknown 128463977 0 0) 0x557fd4f99200 con 0x557fd4f9e000
...
2021-06-26T03:55:00.100+0000 7f8a83812700  7 mds.4.server reconnect timed out, 1 clients have not reconnected in time
2021-06-26T03:55:00.100+0000 7f8a83812700  1 mds.4.server reconnect gives up on client.4764 v1:192.168.0.1:0/499264774
2021-06-26T03:55:00.100+0000 7f8a83812700  0 log_channel(cluster) log [WRN] : evicting unresponsive client  (4764), after waiting 65.4709 seconds during MDS startup
2021-06-26T03:55:00.100+0000 7f8a83812700  1 mds.4.29 Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774)
2021-06-26T03:55:00.100+0000 7f8a83812700  0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774)
2021-06-26T03:55:00.100+0000 7f8a83812700  4 mds.4.29 Preparing blocklist command... (wait=0)
2021-06-26T03:55:00.100+0000 7f8a83812700  4 mds.4.29 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/499264774"}

From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/log/ceph-mds.e.log.gz

Not clear what the client failed to finish doing to complete reconnect.


Related issues 3 (2 open1 closed)

Related to CephFS - Feature #51416: kclient: add debugging for mds failover eventsFix Under ReviewPatrick Donnelly

Actions
Related to CephFS - Bug #54461: ffsb.sh test failureResolvedXiubo Li

Actions
Related to Linux kernel client - Bug #57656: [testing] dbench: write failed on handle 10009 (Resource temporarily unavailable)Need More InfoXiubo Li

Actions
Actions

Also available in: Atom PDF