Project

General

Profile

Bug #51410

kclient: fails to finish reconnect during MDS thrashing (testing branch)

Added by Patrick Donnelly over 1 year ago. Updated 7 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-26T03:53:10.735 INFO:tasks.mds_thrash.fs.[cephfs]:kill mds.a (rank=4)
...
2021-06-26T03:53:29.940 INFO:tasks.mds_thrash.fs.[cephfs]:mds map: {
...
2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Wrote -1 instead of 4096 bytes.
2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Probably out of disk space
2021-06-26T03:55:00.943 INFO:tasks.workunit.client.0.smithi075.stderr:write: Input/output error

From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/teuthology.log

mds.e took over for rank 4. We see this in the logs:

2021-06-26T03:53:34.065+0000 7f8a8801b700  1 --1- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] >>  conn(0x557fd4f9e000 0x557fd4e25800 :6836 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=33 legacy v1:172.21.15.75:6836/4253739342 socket_addr v1:172.21.15.75:6836/4253739342 target_addr v1:192.168.0.1:49634/0
2021-06-26T03:53:34.065+0000 7f8a8801b700 10 mds.e parse_caps: parsing auth_cap_str='allow'
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e  existing session 0x557fd4fc8500 for client.4764 v1:192.168.0.1:0/499264774 existing con 0, new/authorizing con 0x557fd4f9e000
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e parse_caps: parsing auth_cap_str='allow'
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e ms_handle_accept v1:192.168.0.1:0/499264774 con 0x557fd4f9e000 session 0x557fd4fc8500
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e  session connection 0 -> 0x557fd4f9e000
2021-06-26T03:53:34.065+0000 7f8a85015700  1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 1 ==== client_session(request_open) v4 ==== 169+0+0 (unknown 1896856091 0 0) 0x557fd4128a80 con 0x557fd4f9e000
...
2021-06-26T03:53:38.005+0000 7f8a85015700  1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 2 ==== client_caps(update ino 0x2000000022c 5 seq 0 tid 1806 caps=pAsLsXsFscb dirty=Fw wanted=- follows 1 mseq 1 size 104857600/0 mtime 2021-06-26T03:53:02.937619+0000 xattrs(v=18446744071724651093 l=0)) v10 ==== 236+0+0 (unknown 128463977 0 0) 0x557fd4f99200 con 0x557fd4f9e000
...
2021-06-26T03:55:00.100+0000 7f8a83812700  7 mds.4.server reconnect timed out, 1 clients have not reconnected in time
2021-06-26T03:55:00.100+0000 7f8a83812700  1 mds.4.server reconnect gives up on client.4764 v1:192.168.0.1:0/499264774
2021-06-26T03:55:00.100+0000 7f8a83812700  0 log_channel(cluster) log [WRN] : evicting unresponsive client  (4764), after waiting 65.4709 seconds during MDS startup
2021-06-26T03:55:00.100+0000 7f8a83812700  1 mds.4.29 Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774)
2021-06-26T03:55:00.100+0000 7f8a83812700  0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774)
2021-06-26T03:55:00.100+0000 7f8a83812700  4 mds.4.29 Preparing blocklist command... (wait=0)
2021-06-26T03:55:00.100+0000 7f8a83812700  4 mds.4.29 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/499264774"}

From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/log/ceph-mds.e.log.gz

Not clear what the client failed to finish doing to complete reconnect.


Related issues

Related to CephFS - Feature #51416: kclient: add debugging for mds failover events Fix Under Review
Related to CephFS - Bug #54461: ffsb.sh test failure Pending Backport
Related to Linux kernel client - Bug #57656: [testing] dbench: write failed on handle 10009 (Resource temporarily unavailable) Need More Info

History

#1 Updated by Patrick Donnelly over 1 year ago

Some interesting messages in the kernel log: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/syslog/kern.log.gz

#2 Updated by Patrick Donnelly over 1 year ago

2021-06-26T03:54:10.008307+00:00 smithi075 kernel: [  852.268592] ceph: mds4 caps renewed
2021-06-26T03:55:00.678530+00:00 smithi075 kernel: [  902.593908] libceph: mds4 (1)172.21.15.75:6836 socket closed (con state OPEN)

Does "mds4 caps renewed" indicate the client should have "finished" reconnecting?

#3 Updated by Jeff Layton over 1 year ago

It comes from this swath of code, that gets called when the caps have been renewed:

        was_stale = is_renew && time_after_eq(jiffies, session->s_cap_ttl);

        session->s_cap_ttl = session->s_renew_requested +
                mdsc->mdsmap->m_session_timeout*HZ;

        if (was_stale) {
                if (time_before(jiffies, session->s_cap_ttl)) {
                        pr_info("mds%d caps renewed\n", session->s_mds);
                        wake = 1;
                } else {
                        pr_info("mds%d caps still stale\n", session->s_mds);
                }
        }
        dout("renewed_caps mds%d ttl now %lu, was %s, now %s\n",
             session->s_mds, session->s_cap_ttl, was_stale ? "stale" : "fresh",
             time_before(jiffies, session->s_cap_ttl) ? "stale" : "fresh");
So, basically we got a CEPH_SESSION_RENEWCAPS message from the MDS at that point, and the cap renewal was successful after they had previously gone stale.

#4 Updated by Patrick Donnelly over 1 year ago

  • Related to Feature #51416: kclient: add debugging for mds failover events added

#5 Updated by Xiubo Li 11 months ago

  • Related to Bug #54461: ffsb.sh test failure added

#6 Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v17.0.0)

#7 Updated by Xiubo Li 4 months ago

  • Related to Bug #57656: [testing] dbench: write failed on handle 10009 (Resource temporarily unavailable) added

Also available in: Atom PDF