Bug #51410: kclient: fails to finish reconnect during MDS thrashing (testing branch) - CephFS - Ceph

Actions

Copy link

Bug #51410

open

kclient: fails to finish reconnect during MDS thrashing (testing branch)

Added by Patrick Donnelly almost 3 years ago. Updated almost 2 years ago.

Status:

New

Priority:

High

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS, kceph

Labels (FS):

qa, qa-failure

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2021-06-26T03:53:10.735 INFO:tasks.mds_thrash.fs.[cephfs]:kill mds.a (rank=4)
...
2021-06-26T03:53:29.940 INFO:tasks.mds_thrash.fs.[cephfs]:mds map: {
...
2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Wrote -1 instead of 4096 bytes.
2021-06-26T03:55:00.942 INFO:tasks.workunit.client.0.smithi075.stdout:Probably out of disk space
2021-06-26T03:55:00.943 INFO:tasks.workunit.client.0.smithi075.stderr:write: Input/output error

From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/teuthology.log

mds.e took over for rank 4. We see this in the logs:

2021-06-26T03:53:34.065+0000 7f8a8801b700  1 --1- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] >>  conn(0x557fd4f9e000 0x557fd4e25800 :6836 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=33 legacy v1:172.21.15.75:6836/4253739342 socket_addr v1:172.21.15.75:6836/4253739342 target_addr v1:192.168.0.1:49634/0
2021-06-26T03:53:34.065+0000 7f8a8801b700 10 mds.e parse_caps: parsing auth_cap_str='allow'
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e  existing session 0x557fd4fc8500 for client.4764 v1:192.168.0.1:0/499264774 existing con 0, new/authorizing con 0x557fd4f9e000
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e parse_caps: parsing auth_cap_str='allow'
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e ms_handle_accept v1:192.168.0.1:0/499264774 con 0x557fd4f9e000 session 0x557fd4fc8500
2021-06-26T03:53:34.065+0000 7f8a85015700 10 mds.e  session connection 0 -> 0x557fd4f9e000
2021-06-26T03:53:34.065+0000 7f8a85015700  1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 1 ==== client_session(request_open) v4 ==== 169+0+0 (unknown 1896856091 0 0) 0x557fd4128a80 con 0x557fd4f9e000
...
2021-06-26T03:53:38.005+0000 7f8a85015700  1 -- [v2:172.21.15.75:6834/4253739342,v1:172.21.15.75:6836/4253739342] <== client.4764 v1:192.168.0.1:0/499264774 2 ==== client_caps(update ino 0x2000000022c 5 seq 0 tid 1806 caps=pAsLsXsFscb dirty=Fw wanted=- follows 1 mseq 1 size 104857600/0 mtime 2021-06-26T03:53:02.937619+0000 xattrs(v=18446744071724651093 l=0)) v10 ==== 236+0+0 (unknown 128463977 0 0) 0x557fd4f99200 con 0x557fd4f9e000
...
2021-06-26T03:55:00.100+0000 7f8a83812700  7 mds.4.server reconnect timed out, 1 clients have not reconnected in time
2021-06-26T03:55:00.100+0000 7f8a83812700  1 mds.4.server reconnect gives up on client.4764 v1:192.168.0.1:0/499264774
2021-06-26T03:55:00.100+0000 7f8a83812700  0 log_channel(cluster) log [WRN] : evicting unresponsive client  (4764), after waiting 65.4709 seconds during MDS startup
2021-06-26T03:55:00.100+0000 7f8a83812700  1 mds.4.29 Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774)
2021-06-26T03:55:00.100+0000 7f8a83812700  0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4764 (v1:192.168.0.1:0/499264774)
2021-06-26T03:55:00.100+0000 7f8a83812700  4 mds.4.29 Preparing blocklist command... (wait=0)
2021-06-26T03:55:00.100+0000 7f8a83812700  4 mds.4.29 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"v1:192.168.0.1:0/499264774"}

From: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/log/ceph-mds.e.log.gz

Not clear what the client failed to finish doing to complete reconnect.

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Some interesting messages in the kernel log: /ceph/teuthology-archive/pdonnell-2021-06-26_00:57:00-fs-wip-pdonnell-testing-20210625.225421-distro-basic-smithi/6193988/remote/smithi075/syslog/kern.log.gz

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

2021-06-26T03:54:10.008307+00:00 smithi075 kernel: [  852.268592] ceph: mds4 caps renewed
2021-06-26T03:55:00.678530+00:00 smithi075 kernel: [  902.593908] libceph: mds4 (1)172.21.15.75:6836 socket closed (con state OPEN)

Does "mds4 caps renewed" indicate the client should have "finished" reconnecting?

Actions

Copy link

Updated by Jeff Layton almost 3 years ago

It comes from this swath of code, that gets called when the caps have been renewed:

        was_stale = is_renew && time_after_eq(jiffies, session->s_cap_ttl);

        session->s_cap_ttl = session->s_renew_requested +
                mdsc->mdsmap->m_session_timeout*HZ;

        if (was_stale) {
                if (time_before(jiffies, session->s_cap_ttl)) {
                        pr_info("mds%d caps renewed\n", session->s_mds);
                        wake = 1;
                } else {
                        pr_info("mds%d caps still stale\n", session->s_mds);
                }
        }
        dout("renewed_caps mds%d ttl now %lu, was %s, now %s\n",
             session->s_mds, session->s_cap_ttl, was_stale ? "stale" : "fresh",
             time_before(jiffies, session->s_cap_ttl) ? "stale" : "fresh");

So, basically we got a CEPH_SESSION_RENEWCAPS message from the MDS at that point, and the cap renewal was successful after they had previously gone stale.

Actions

Copy link