Project

General

Profile

Bug #46282

qa: multiclient connection interruptions by stopping one client

Added by Patrick Donnelly 9 days ago. Updated 1 day ago.

Status:
Fix Under Review
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
Pull request ID:
Crash signature:

Description

2020-06-30T01:01:08.478 INFO:tasks.cephfs_test_runner:======================================================================
2020-06-30T01:01:08.478 INFO:tasks.cephfs_test_runner:FAIL: test_reconnect_timeout (tasks.cephfs.test_client_recovery.TestClientRecovery)
2020-06-30T01:01:08.479 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-06-30T01:01:08.479 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-06-30T01:01:08.479 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200629.214056/qa/tasks/cephfs/test_client_recovery.py", line 167, in test_reconnect_timeout
2020-06-30T01:01:08.480 INFO:tasks.cephfs_test_runner:    self.assert_session_count(1)
2020-06-30T01:01:08.480 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20200629.214056/qa/tasks/cephfs/cephfs_test_case.py", line 202, in assert_session_count
2020-06-30T01:01:08.480 INFO:tasks.cephfs_test_runner:    expected, alive_count
2020-06-30T01:01:08.480 INFO:tasks.cephfs_test_runner:AssertionError: 1 != 0 : Expected 1 sessions, found 0

From: /ceph/teuthology-archive/pdonnell-2020-06-29_23:19:06-fs-wip-pdonnell-testing-20200629.214056-distro-basic-smithi/5190446/teuthology.log

Test is functioning properly but we have, in the client which should survive:

2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 <== mon.0 v2:172.21.15.63:3300/0 7 ==== mdsmap(e 103) v1 ==== 688+0+0 (secure 0 0 0) 0x7f9ebc013730 con 0x556991ab2d80
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 client.10650 handle_mds_map epoch 103
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 client.10650 mds incarnation changed from 0 to 103
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:6832/2439848416,v1:172.21.15.63:6833/2439848416] conn(0x556991ac59e0 msgr2=0x556991ac8ae0 unknown :-1 s=STATE_CLOSED l=0).mark_down
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:6832/2439848416,v1:172.21.15.63:6833/2439848416] conn(0x556991ac59e0 0x556991ac8ae0 unknown :-1 s=CLOSED pgs=11 cs=3 l=0 rev1=1 rx=0 tx=0).stop
2020-06-30T01:00:12.709+0000 7f9ec2ffd700 20 client.10650 trim_cache_for_reconnect mds.0
2020-06-30T01:00:12.709+0000 7f9ec2ffd700 20 client.10650 trim_cache_for_reconnect mds.0 trimmed 0 dentries
2020-06-30T01:00:12.713+0000 7f9eb27fc700  8 client.10650 _ll_getattr 0x1.head
2020-06-30T01:00:12.714+0000 7f9eb27fc700 10 client.10650 _getattr mask pAsLsXsFs issued=0
2020-06-30T01:00:12.714+0000 7f9eb27fc700 15 inode.get on 0x7f9eb4006ac0 0x1.head now 5
2020-06-30T01:00:12.714+0000 7f9eb27fc700 20 client.10650 choose_target_mds starting with req->inode 0x1.head(faked_ino=0 ref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2020-06-30T00:59:56.944301+0000 mtime=2020-06-30T00:59:56.944301+0000 ctime=2020-06-30T01:00:09.928577+0000 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7f9eb4006ac0)
2020-06-30T01:00:12.714+0000 7f9eb27fc700 20 client.10650 choose_target_mds 0x1.head(faked_ino=0 ref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2020-06-30T00:59:56.944301+0000 mtime=2020-06-30T00:59:56.944301+0000 ctime=2020-06-30T01:00:09.928577+0000 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7f9eb4006ac0) is_hash=0 hash=0
2020-06-30T01:00:12.714+0000 7f9eb27fc700 10 client.10650 choose_target_mds from caps on inode 0x1.head(faked_ino=0 ref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2020-06-30T00:59:56.944301+0000 mtime=2020-06-30T00:59:56.944301+0000 ctime=2020-06-30T01:00:09.928577+0000 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7f9eb4006ac0)
2020-06-30T01:00:12.714+0000 7f9eb27fc700 20 client.10650 mds is 0
2020-06-30T01:00:12.714+0000 7f9eb27fc700 10 client.10650  target mds.0 not active, waiting for new mdsmap
2020-06-30T01:00:13.279+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:14.279+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:15.279+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:16.279+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:17.280+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:18.280+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:19.280+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
...
2020-06-30T01:01:04.754+0000 7f9ec2ffd700 10 client.10650 ms_handle_connect on v2:172.21.15.63:3300/0
2020-06-30T01:01:04.755+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 <== mon.0 v2:172.21.15.63:3300/0 1 ==== mon_map magic: 0 v1 ==== 383+0+0 (secure 0 0 0) 0x7f9ec401d700 con 0x7f9ea80022e0
2020-06-30T01:01:04.755+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 <== mon.0 v2:172.21.15.63:3300/0 2 ==== config(0 keys) v1 ==== 4+0+0 (secure 0 0 0) 0x7f9ec401d860 con 0x7f9ea80022e0
2020-06-30T01:01:04.755+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 <== mon.0 v2:172.21.15.63:3300/0 3 ==== mdsmap(e 106) v1 ==== 688+0+0 (secure 0 0 0) 0x7f9ec4025040 con 0x7f9ea80022e0

From: /ceph/teuthology-archive/pdonnell-2020-06-29_23:19:06-fs-wip-pdonnell-testing-20200629.214056-distro-basic-smithi/5190446/remote/smithi063/log/ceph-client.1.95095.log.gz

Last message the mon sent that was received:

2020-06-30T01:00:12.709+0000 7f737067c700  1 -- [v2:172.21.15.63:3300/0,v1:172.21.15.63:6789/0] --> 192.168.0.2:0/2664284312 -- mdsmap(e 103) v1 -- 0x55c755520000 con 0x55c755950000

From: /ceph/teuthology-archive/pdonnell-2020-06-29_23:19:06-fs-wip-pdonnell-testing-20200629.214056-distro-basic-smithi/5190446/remote/smithi063/log/ceph-mon.a.log.gz

The timing of this approximately coincides with tearing down the network namespace of the other mount:

2020-06-30T01:00:11.267 INFO:tasks.cephfs.mount:Cleaning the 'ceph-ns--home-ubuntu-cephtest-mnt.0' netns for '/home/ubuntu/cephtest/mnt.0'
2020-06-30T01:00:11.268 INFO:tasks.cephfs.mount:Removing the netns 'ceph-ns--home-ubuntu-cephtest-mnt.0'
2020-06-30T01:00:11.268 INFO:teuthology.orchestra.run:Running command with timeout 300
2020-06-30T01:00:11.269 INFO:teuthology.orchestra.run.smithi063:> sudo bash -c 'ip link set brx.1 down'
2020-06-30T01:00:11.347 INFO:teuthology.orchestra.run:Running command with timeout 300
2020-06-30T01:00:11.348 INFO:teuthology.orchestra.run.smithi063:> sudo bash -c 'ip link delete brx.1'
2020-06-30T01:00:11.450 INFO:teuthology.orchestra.run:Running command with timeout 300
2020-06-30T01:00:11.451 INFO:teuthology.orchestra.run.smithi063:> sudo bash -c 'ip netns delete ceph-ns--home-ubuntu-cephtest-mnt.0'
2020-06-30T01:00:11.512 INFO:teuthology.orchestra.run:Running command with timeout 300
2020-06-30T01:00:11.513 INFO:teuthology.orchestra.run.smithi063:> ip addr
2020-06-30T01:00:11.528 INFO:teuthology.orchestra.run.smithi063.stdout:1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2020-06-30T01:00:11.528 INFO:teuthology.orchestra.run.smithi063.stdout:    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2020-06-30T01:00:11.529 INFO:teuthology.orchestra.run.smithi063.stdout:    inet 127.0.0.1/8 scope host lo
2020-06-30T01:00:11.529 INFO:teuthology.orchestra.run.smithi063.stdout:       valid_lft forever preferred_lft forever
2020-06-30T01:00:11.529 INFO:teuthology.orchestra.run.smithi063.stdout:    inet6 ::1/128 scope host
2020-06-30T01:00:11.529 INFO:teuthology.orchestra.run.smithi063.stdout:       valid_lft forever preferred_lft forever
2020-06-30T01:00:11.530 INFO:teuthology.orchestra.run.smithi063.stdout:2: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
2020-06-30T01:00:11.530 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:a8:b4:76 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.530 INFO:teuthology.orchestra.run.smithi063.stdout:3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
2020-06-30T01:00:11.530 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:a8:b4:77 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.530 INFO:teuthology.orchestra.run.smithi063.stdout:4: enp3s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
2020-06-30T01:00:11.530 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:b6:f8:d6 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.531 INFO:teuthology.orchestra.run.smithi063.stdout:5: enp3s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
2020-06-30T01:00:11.531 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:b6:f8:d7 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.531 INFO:teuthology.orchestra.run.smithi063.stdout:    inet 172.21.15.63/20 brd 172.21.15.255 scope global dynamic noprefixroute enp3s0f1
2020-06-30T01:00:11.531 INFO:teuthology.orchestra.run.smithi063.stdout:       valid_lft 41721sec preferred_lft 41721sec
2020-06-30T01:00:11.531 INFO:teuthology.orchestra.run.smithi063.stdout:    inet6 fe80::ec4:7aff:feb6:f8d7/64 scope link
2020-06-30T01:00:11.532 INFO:teuthology.orchestra.run.smithi063.stdout:       valid_lft forever preferred_lft forever
2020-06-30T01:00:11.532 INFO:teuthology.orchestra.run.smithi063.stdout:50: ceph-brx: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
2020-06-30T01:00:11.532 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether e6:2f:8d:99:5b:09 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.532 INFO:teuthology.orchestra.run.smithi063.stdout:    inet 192.168.255.254/16 brd 192.168.255.255 scope global ceph-brx
2020-06-30T01:00:11.532 INFO:teuthology.orchestra.run.smithi063.stdout:       valid_lft forever preferred_lft forever
2020-06-30T01:00:11.533 INFO:teuthology.orchestra.run.smithi063.stdout:    inet6 fe80::4801:d3ff:fe0f:dee9/64 scope link
2020-06-30T01:00:11.533 INFO:teuthology.orchestra.run.smithi063.stdout:       valid_lft forever preferred_lft forever
2020-06-30T01:00:11.533 INFO:teuthology.orchestra.run.smithi063.stdout:52: brx.2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ceph-brx state UP group default qlen 1000
2020-06-30T01:00:11.533 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether e6:2f:8d:99:5b:09 brd ff:ff:ff:ff:ff:ff link-netns ceph-ns--home-ubuntu-cephtest-mnt.1
2020-06-30T01:00:11.533 INFO:teuthology.orchestra.run.smithi063.stdout:    inet6 fe80::e42f:8dff:fe99:5b09/64 scope link
2020-06-30T01:00:11.534 INFO:teuthology.orchestra.run.smithi063.stdout:       valid_lft forever preferred_lft forever
HERE
2020-06-30T01:00:11.534 INFO:teuthology.orchestra.run:Running command with timeout 300
2020-06-30T01:00:11.534 INFO:teuthology.orchestra.run.smithi063:> sudo bash -c 'ip link show'
2020-06-30T01:00:11.599 INFO:teuthology.orchestra.run.smithi063.stdout:1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2020-06-30T01:00:11.600 INFO:teuthology.orchestra.run.smithi063.stdout:    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2020-06-30T01:00:11.600 INFO:teuthology.orchestra.run.smithi063.stdout:2: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
2020-06-30T01:00:11.600 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:a8:b4:76 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.600 INFO:teuthology.orchestra.run.smithi063.stdout:3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
2020-06-30T01:00:11.601 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:a8:b4:77 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.601 INFO:teuthology.orchestra.run.smithi063.stdout:4: enp3s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
2020-06-30T01:00:11.601 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:b6:f8:d6 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.601 INFO:teuthology.orchestra.run.smithi063.stdout:5: enp3s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
2020-06-30T01:00:11.601 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether 0c:c4:7a:b6:f8:d7 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.602 INFO:teuthology.orchestra.run.smithi063.stdout:50: ceph-brx: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
2020-06-30T01:00:11.602 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether e6:2f:8d:99:5b:09 brd ff:ff:ff:ff:ff:ff
2020-06-30T01:00:11.602 INFO:teuthology.orchestra.run.smithi063.stdout:52: brx.2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ceph-brx state UP mode DEFAULT group default qlen 1000
2020-06-30T01:00:11.602 INFO:teuthology.orchestra.run.smithi063.stdout:    link/ether e6:2f:8d:99:5b:09 brd ff:ff:ff:ff:ff:ff link-netns ceph-ns--home-ubuntu-cephtest-mnt.1
2020-06-30T01:00:11.603 INFO:tasks.ceph.mds.a:Restarting daemon

We've seen similar failures in other suites and tests, like: https://github.com/ceph/ceph/pull/35496#pullrequestreview-437849297

My current theory is that there is some unexpected interaction between the network bridges during tear down but I don't see how.

History

#1 Updated by Xiubo Li 7 days ago

  • Status changed from New to In Progress

#2 Updated by Xiubo Li 7 days ago

Patrick Donnelly wrote:

[...]

From: /ceph/teuthology-archive/pdonnell-2020-06-29_23:19:06-fs-wip-pdonnell-testing-20200629.214056-distro-basic-smithi/5190446/teuthology.log

Test is functioning properly but we have, in the client which should survive:

[...]

From: /ceph/teuthology-archive/pdonnell-2020-06-29_23:19:06-fs-wip-pdonnell-testing-20200629.214056-distro-basic-smithi/5190446/remote/smithi063/log/ceph-client.1.95095.log.gz

Last message the mon sent that was received:

This mdsmap should be "up:replay" state, because it didn't call the "send_reconnect(session);" and no any log about this, please see:

2020-06-30T01:00:11.279+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:12.279+0000 7f9ec97fa700 20 client.10650 trim_cache size 0 max 16384
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 <== mon.0 v2:172.21.15.63:3300/0 7 ==== mdsmap(e 103) v1 ==== 688+0+0 (secure 0 0 0) 0x7f9ebc013730 con 0x556991ab2d80
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 client.10650 handle_mds_map epoch 103
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 client.10650 mds incarnation changed from 0 to 103
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:6832/2439848416,v1:172.21.15.63:6833/2439848416] conn(0x556991ac59e0 msgr2=0x556991ac8ae0 unknown :-1 s=STATE_CLOSED l=0).mark_down
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:6832/2439848416,v1:172.21.15.63:6833/2439848416] conn(0x556991ac59e0 0x556991ac8ae0 unknown :-1 s=CLOSED pgs=11 cs=3 l=0 rev1=1 rx=0 tx=0).stop
2020-06-30T01:00:12.709+0000 7f9ec2ffd700 20 client.10650 trim_cache_for_reconnect mds.0
2020-06-30T01:00:12.709+0000 7f9ec2ffd700 20 client.10650 trim_cache_for_reconnect mds.0 trimmed 0 dentries
2020-06-30T01:00:12.713+0000 7f9eb27fc700  8 client.10650 _ll_getattr 0x1.head

And the next mdsmap followed was "up:active" directly:

2020-06-30T01:01:04.755+0000 7f9ec2ffd700  1 client.10650 handle_mds_map epoch 106
2020-06-30T01:01:04.755+0000 7f9ec2ffd700  1 client.10650 we may miss the MDSMap::RECONNECT, close mds session ...
2020-06-30T01:01:04.755+0000 7f9ec2ffd700  5 client.10650 _closed_mds_session mds.0 seq 0

And it says "we may miss the MDSMap::RECONNECT" and then closed the session.

That means the client.10650 didn't do the session reconnect at all. This should be the root cause of this issue.

I will check why the mon didn't send the up:reconnect mdsmap to client.10650.

Thanks

#3 Updated by Xiubo Li 3 days ago

Locally with the local qa test I can see the same connection refuse logs, after the client gets the up:replay mdsmap:

2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 client.10650 handle_mds_map epoch 103
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 client.10650 mds incarnation changed from 0 to 103
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 -- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:6832/2439848416,v1:172.21.15.63:6833/2439848416] conn(0x556991ac59e0 msgr2=0x556991ac8ae0 unknown :-1 s=STATE_CLOSED l=0).mark_down
2020-06-30T01:00:12.709+0000 7f9ec2ffd700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:6832/2439848416,v1:172.21.15.63:6833/2439848416] conn(0x556991ac59e0 0x556991ac8ae0 unknown :-1 s=CLOSED pgs=11 cs=3 l=0 rev1=1 rx=0 tx=0).stop
2020-06-30T01:00:12.709+0000 7f9ec2ffd700 20 client.10650 trim_cache_for_reconnect mds.0
2020-06-30T01:00:12.709+0000 7f9ec2ffd700 20 client.10650 trim_cache_for_reconnect mds.0 trimmed 0 dentries
2020-06-30T01:00:12.713+0000 7f9eb27fc700  8 client.10650 _ll_getattr 0x1.head
2020-06-30T01:00:12.714+0000 7f9eb27fc700 10 client.10650 _getattr mask pAsLsXsFs issued=0
2020-06-30T01:00:12.714+0000 7f9eb27fc700 15 inode.get on 0x7f9eb4006ac0 0x1.head now 5
2020-06-30T01:00:12.714+0000 7f9eb27fc700 20 client.10650 choose_target_mds starting with req->inode 0x1.head(faked_ino=0 ref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2020-06-30T00:59:56.944301+0000 mtime=2020-06-30T00:59:56.944301+0000 ctime=2020-06-30T01:00:09.928577+0000 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7f9eb4006ac0)
2020-06-30T01:00:12.714+0000 7f9eb27fc700 20 client.10650 choose_target_mds 0x1.head(faked_ino=0 ref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2020-06-30T00:59:56.944301+0000 mtime=2020-06-30T00:59:56.944301+0000 ctime=2020-06-30T01:00:09.928577+0000 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7f9eb4006ac0) is_hash=0 hash=0
2020-06-30T01:00:12.714+0000 7f9eb27fc700 10 client.10650 choose_target_mds from caps on inode 0x1.head(faked_ino=0 ref=5 ll_ref=7 cap_refs={} open={} mode=41777 size=0/0 nlink=1 btime=2020-06-30T00:59:56.944301+0000 mtime=2020-06-30T00:59:56.944301+0000 ctime=2020-06-30T01:00:09.928577+0000 caps=pLsXsFs(0=pLsXsFs) COMPLETE has_dir_layout 0x7f9eb4006ac0)
2020-06-30T01:00:12.714+0000 7f9eb27fc700 20 client.10650 mds is 0
2020-06-30T01:00:12.714+0000 7f9eb27fc700 10 client.10650  target mds.0 not active, waiting for new mdsmap

...

2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 -- 192.168.0.2:0/2664284312 >> [v2:172.21.15.146:3301/0,v1:172.21.15.146:6790/0] conn(0x7f9ea8007690 msgr2=0x7f9ea8009b50 unknown :-1 s=STATE_CONNECTING_RE l=1).mark_down
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.146:3301/0,v1:172.21.15.146:6790/0] conn(0x7f9ea8007690 0x7f9ea8009b50 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).stop
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 -- 192.168.0.2:0/2664284312 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f9ea8004a70 msgr2=0x7f9ea8006f10 unknown :-1 s=STATE_CONNECTING_RE l=1).mark_down
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f9ea8004a70 0x7f9ea8006f10 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).stop
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 -- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:3300/0,v1:172.21.15.63:6789/0] conn(0x7f9ea8001f00 msgr2=0x7f9ea8004350 unknown :-1 s=STATE_CONNECTING_RE l=1).mark_down
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:3300/0,v1:172.21.15.63:6789/0] conn(0x7f9ea8001f00 0x7f9ea8004350 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).stop
2020-06-30T01:00:44.781+0000 7f9ed1c59700  1 -- 192.168.0.2:0/2664284312 reap_dead start
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] conn(0x7f9ea800ae90 0x7f9ea800d2e0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2020-06-30T01:00:44.781+0000 7f9ed1c59700  1 -- 192.168.0.2:0/2664284312 reap_dead start
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.63:3300/0,v1:172.21.15.63:6789/0] conn(0x7f9ea8004f00 0x7f9ea8007b40 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 --2- 192.168.0.2:0/2664284312 >> [v2:172.21.15.146:3301/0,v1:172.21.15.146:6790/0] conn(0x7f9ea8008030 0x7f9ea80042f0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2020-06-30T01:00:44.781+0000 7f9ed1c59700  1 -- 192.168.0.2:0/2664284312 reap_dead start
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 -- 192.168.0.2:0/2664284312 --> [v2:172.21.15.63:3300/0,v1:172.21.15.63:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f9ea800a350 con 0x7f9ea8004f00
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 -- 192.168.0.2:0/2664284312 --> [v2:172.21.15.146:3300/0,v1:172.21.15.146:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f9ea800a4c0 con 0x7f9ea800ae90
2020-06-30T01:00:44.781+0000 7f9ec1ffb700  1 -- 192.168.0.2:0/2664284312 --> [v2:172.21.15.146:3301/0,v1:172.21.15.146:6790/0] -- mon_getmap magic: 0 v1 -- 0x7f9ea800a680 con 0x7f9ea8008030

...

In some case the above connection refuse last in very short time, but in some cases if it last too long, the client will lost the new mdsmap messages, and then the reconnecting will be skippt.

Currently we will delete the brx.N bridge interface directly, this won't delete the container veth pair:

diff --git a/qa/tasks/cephfs/mount.py b/qa/tasks/cephfs/mount.py
index 2cfa1cc3f7..6d55c4995d 100644
--- a/qa/tasks/cephfs/mount.py
+++ b/qa/tasks/cephfs/mount.py
@@ -235,8 +235,9 @@ class CephFSMount(object):
         args = ["sudo", "bash", "-c",
                 "ip link set brx.{0} down".format(self.nsid)]
         self.client_remote.run(args=args, timeout=(5*60))
+
         args = ["sudo", "bash", "-c",
-                "ip link delete brx.{0}".format(self.nsid)]
+                "ip link delete dev brx.{0} master ceph-brx".format(self.nsid)]
         self.client_remote.run(args=args, timeout=(5*60))

         args = ["sudo", "bash", "-c",

With the above fix, there hasn't any connection refuse reproduced, but I still couldn't sure why ?

#4 Updated by Xiubo Li 1 day ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 35944

Also available in: Atom PDF