Bug #51410
openkclient: fails to finish reconnect during MDS thrashing (testing branch)
0%
Description
2021-06-26T03:53:10.735 INFO:tasks.mds_thrash.fs.[cephfs]:kill mds.a (rank=4) ... 2021-06-26T03:53:29.940 INFO:tasks.mds_thrash.fs.[cephfs]:mds map: { ... 2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Wrote -1 instead of 4096 bytes. 2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Probably out of disk space 2021-06-26T03:55:00.943 INFO:tasks.workunit.client.0.smithi075.stderr:write: Input/output error
From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/teuthology.log
mds.e took over for rank 4. We see this in the logs:
2021-06-26T03:53:34.065+0000 7f8a8801b700 1 --1- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] >> conn(0x557fd4f9e000 0x557fd4e25800 :6836 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=33 legacy v1:172.21.15.75:6836/4253739342 socket_addr v1:172.21.15.75:6836/4253739342 target_addr v1:192.168.0.1:49634/0 2021-06-26T03:53:34.065+0000 7f8a8801b700 10 mds.e parse_caps: parsing auth_cap_str='allow' 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e existing session 0x557fd4fc8500 for client.4764 v1:192.168.0.1:0/499264774 existing con 0, new/authorizing con 0x557fd4f9e000 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e parse_caps: parsing auth_cap_str='allow' 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e ms_handle_accept v1:192.168.0.1:0/499264774 con 0x557fd4f9e000 session 0x557fd4fc8500 2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e session connection 0 -> 0x557fd4f9e000 2021-06-26T03:53:34.065+0000 7f8a85015700 1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 1 ==== client_session(request_open) v4 ==== 169+0+0 (unknown 1896856091 0 0) 0x557fd4128a80 con 0x557fd4f9e000 ... 2021-06-26T03:53:38.005+0000 7f8a85015700 1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 2 ==== client_caps(update ino 0x2000000022c 5 seq 0 tid 1806 caps=pAsLsXsFscb dirty=Fw wanted=- follows 1 mseq 1 size 104857600/0 mtime 2021-06-26T03:53:02.937619+0000 xattrs(v=18446744071724651093 l=0)) v10 ==== 236+0+0 (unknown 128463977 0 0) 0x557fd4f99200 con 0x557fd4f9e000 ... 2021-06-26T03:55:00.100+0000 7f8a83812700 7 mds.4.server reconnect timed out, 1 clients have not reconnected in time 2021-06-26T03:55:00.100+0000 7f8a83812700 1 mds.4.server reconnect gives up on client.4764 v1:192.168.0.1:0/499264774 2021-06-26T03:55:00.100+0000 7f8a83812700 0 log_channel(cluster) log [WRN] : evicting unresponsive client (4764), after waiting 65.4709 seconds during MDS startup 2021-06-26T03:55:00.100+0000 7f8a83812700 1 mds.4.29 Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774) 2021-06-26T03:55:00.100+0000 7f8a83812700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774) 2021-06-26T03:55:00.100+0000 7f8a83812700 4 mds.4.29 Preparing blocklist command... (wait=0) 2021-06-26T03:55:00.100+0000 7f8a83812700 4 mds.4.29 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/499264774"}
From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/log/ceph-mds.e.log.gz
Not clear what the client failed to finish doing to complete reconnect.
Updated by Patrick Donnelly almost 3 years ago
Some interesting messages in the kernel log: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/syslog/kern.log.gz
Updated by Patrick Donnelly almost 3 years ago
2021-06-26T03:54:10.008307+00:00 smithi075 kernel: [ 852.268592] ceph: mds4 caps renewed 2021-06-26T03:55:00.678530+00:00 smithi075 kernel: [ 902.593908] libceph: mds4 (1)172.21.15.75:6836 socket closed (con state OPEN)
Does "mds4 caps renewed" indicate the client should have "finished" reconnecting?
Updated by Jeff Layton almost 3 years ago
It comes from this swath of code, that gets called when the caps have been renewed:
was_stale = is_renew && time_after_eq(jiffies, session->s_cap_ttl); session->s_cap_ttl = session->s_renew_requested + mdsc->mdsmap->m_session_timeout*HZ; if (was_stale) { if (time_before(jiffies, session->s_cap_ttl)) { pr_info("mds%d caps renewed\n", session->s_mds); wake = 1; } else { pr_info("mds%d caps still stale\n", session->s_mds); } } dout("renewed_caps mds%d ttl now %lu, was %s, now %s\n", session->s_mds, session->s_cap_ttl, was_stale ? "stale" : "fresh", time_before(jiffies, session->s_cap_ttl) ? "stale" : "fresh");So, basically we got a CEPH_SESSION_RENEWCAPS message from the MDS at that point, and the cap renewal was successful after they had previously gone stale.
Updated by Patrick Donnelly almost 3 years ago
- Related to Feature #51416: kclient: add debugging for mds failover events added
Updated by Xiubo Li about 2 years ago
- Related to Bug #54461: ffsb.sh test failure added
Updated by Xiubo Li over 1 year ago
- Related to Bug #57656: [testing] dbench: write failed on handle 10009 (Resource temporarily unavailable) added