Bug #48439
fsstress failure with mds thrashing: "mds.0.6 Evicting (and blocklisting) client session 4564 (v1:172.21.15.47:0/603539598)"
0%
Description
2020-12-02T12:04:39.361+0000 7f965bac6700 7 mds.0.server reconnect timed out, 1 clients have not reconnected in time 2020-12-02T12:04:39.361+0000 7f965bac6700 1 mds.0.server reconnect gives up on client.4564 v1:172.21.15.47:0/603539598 2020-12-02T12:04:39.361+0000 7f965bac6700 0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi047: (4564), after waiting 46.0999 seconds during MDS startup
From: /ceph/teuthology-archive/pdonnell-2020-12-02_07:09:18-fs-wip-pdonnell-testing-20201202.050726-distro-basic-smithi/5674936/remote/smithi083/log/ceph-mds.b.log.gz
(and others from that run. stock RHEL 8.3 and testing kernels.)
relevant lines from kernel log:
2020-12-02T12:03:53.267177+00:00 smithi047 kernel: ceph: mds0 reconnect start 2020-12-02T12:03:53.293238+00:00 smithi047 kernel: libceph: mds0 (1)172.21.15.83:6835 socket error on write 2020-12-02T12:04:42.388134+00:00 smithi047 kernel: ceph: mds0 recovery completed
From: /ceph/teuthology-archive/pdonnell-2020-12-02_07:09:18-fs-wip-pdonnell-testing-20201202.050726-distro-basic-smithi/5674936/remote/smithi047/syslog/kern.log.gz
Related issues
History
#1 Updated by Patrick Donnelly about 2 months ago
relevant ECONRESET:
2020-12-02T12:03:53.283+0000 7f965face700 1 -- [v2:172.21.15.83:6834/3199959640,v1:172.21.15.83:6835/3199959640] >> v1:172.21.15.47:0/603539598 conn(0x559ae240fc00 legacy=0x559ae21eb800 unknown :6835 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-12-02T12:03:53.283+0000 7f965face700 1 --1- [v2:172.21.15.83:6834/3199959640,v1:172.21.15.83:6835/3199959640] >> v1:172.21.15.47:0/603539598 conn(0x559ae240fc00 0x559ae21eb800 :6835 s=OPENED pgs=22 cs=1 l=0).handle_message read tag failed
#2 Updated by Jeff Layton about 2 months ago
I wonder if this is the same problem as https://tracker.ceph.com/issues/47563? What kernel was the client running?
#3 Updated by Jeff Layton about 2 months ago
Answering my own question, looks like: 4.18.0-240.1.1.el8_3.x86_64. I'd be interested to see if this is still a problem with recent testing kernels.
#4 Updated by Patrick Donnelly about 2 months ago
- Description updated (diff)
#5 Updated by Patrick Donnelly about 2 months ago
Jeff Layton wrote:
Answering my own question, looks like: 4.18.0-240.1.1.el8_3.x86_64. I'd be interested to see if this is still a problem with recent testing kernels.
We talked on IRC but for posterity: it only happens with rhel 8.3 stock kernel somewhat reliably.
#6 Updated by Patrick Donnelly about 2 months ago
- Related to Bug #47563: qa: kernel client closes session improperly causing eviction due to timeout added
#7 Updated by Patrick Donnelly 7 days ago
- Target version changed from v16.0.0 to v17.0.0
- Backport set to pacific,octopus,nautilus