Bug #48439: fsstress failure with mds thrashing: "mds.0.6 Evicting (and blocklisting) client session 4564 (v1:172.21.15.47:0/603539598)" - CephFS - Ceph

Actions

Copy link

Bug #48439

closed

fsstress failure with mds thrashing: "mds.0.6 Evicting (and blocklisting) client session 4564 (v1:172.21.15.47:0/603539598)"

Added by Patrick Donnelly over 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

Target version:

Ceph - v17.0.0

% Done:

Source:

Q/A

Tags:

Backport:

pacific,octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

kceph

Labels (FS):

qa-failure

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2020-12-02T12:04:39.361+0000 7f965bac6700  7 mds.0.server reconnect timed out, 1 clients have not reconnected in time
2020-12-02T12:04:39.361+0000 7f965bac6700  1 mds.0.server reconnect gives up on client.4564 v1:172.21.15.47:0/603539598
2020-12-02T12:04:39.361+0000 7f965bac6700  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi047: (4564), after waiting 46.0999 seconds during MDS startup

From: /ceph/teuthology-archive/pdonnell-2020-12-02_07:09:18-fs-wip-pdonnell-testing-20201202.050726-distro-basic-smithi/5674936/remote/smithi083/log/ceph-mds.b.log.gz

(and others from that run. stock RHEL 8.3 ~~and testing~~ kernels.)

relevant lines from kernel log:

2020-12-02T12:03:53.267177+00:00 smithi047 kernel: ceph: mds0 reconnect start
2020-12-02T12:03:53.293238+00:00 smithi047 kernel: libceph: mds0 (1)172.21.15.83:6835 socket error on write
2020-12-02T12:04:42.388134+00:00 smithi047 kernel: ceph: mds0 recovery completed

From: /ceph/teuthology-archive/pdonnell-2020-12-02_07:09:18-fs-wip-pdonnell-testing-20201202.050726-distro-basic-smithi/5674936/remote/smithi047/syslog/kern.log.gz

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

relevant ECONRESET:

2020-12-02T12:03:53.283+0000 7f965face700  1 -- [v2:172.21.15.83:6834/3199959640,v1:172.21.15.83:6835/3199959640] >> v1:172.21.15.47:0/603539598 conn(0x559ae240fc00 legacy=0x559ae21eb800 unknown :6835 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2020-12-02T12:03:53.283+0000 7f965face700  1 --1- [v2:172.21.15.83:6834/3199959640,v1:172.21.15.83:6835/3199959640] >> v1:172.21.15.47:0/603539598 conn(0x559ae240fc00 0x559ae21eb800 :6835 s=OPENED pgs=22 cs=1 l=0).handle_message read tag failed

Actions

Copy link

Updated by Jeff Layton over 3 years ago

I wonder if this is the same problem as https://tracker.ceph.com/issues/47563? What kernel was the client running?

Actions

Copy link

Updated by Jeff Layton over 3 years ago

Answering my own question, looks like: 4.18.0-240.1.1.el8_3.x86_64. I'd be interested to see if this is still a problem with recent testing kernels.

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Jeff Layton wrote:

Answering my own question, looks like: 4.18.0-240.1.1.el8_3.x86_64. I'd be interested to see if this is still a problem with recent testing kernels.

We talked on IRC but for posterity: it only happens with rhel 8.3 stock kernel somewhat reliably.

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Related to Bug #47563: qa: kernel client closes session improperly causing eviction due to timeout added

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Target version changed from v16.0.0 to v17.0.0
Backport set to pacific,octopus,nautilus

Actions

Copy link

Updated by Deepika Upadhyay about 3 years ago

@Patrick I am seeing this issue on Ubuntu 18.04; doesn't seem to be related to testing pr.
Not sure if it's related as you pointed out it might be related to rhel.
can you take a look:

 | egrep -v '\(REQUEST_SLOW\)' | egrep -v '\(TOO_FEW_PGS\)' | egrep -v 'slow request' | head -n 1
2021-04-01T18:05:48.226 INFO:teuthology.orchestra.run.smithi120.stdout:2021-04-01T17:37:06.614214+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi167: (104132), after waiting 49.6717 seconds during MDS startup

  description: rados/upgrade/mimic-x-singleton/{0-cluster/{openstack start} 1-install/mimic
    2-partial-upgrade/firsthalf 3-thrash/default 4-workload/{rbd-cls rbd-import-export
    readwrite snaps-few-objects} 5-workload/{radosbench rbd_api} 6-finish-upgrade 7-octopus
    8-workload/{rbd-python snaps-many-objects} bluestore-bitmap thrashosds-health ubuntu_latest}

/ceph/teuthology-archive/yuriw-2021-04-01_15:23:17-rados-wip-yuri-testing-2021-03-31-1516-octopus-distro-basic-smithi/6014972/teuthology.log

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Deepika Upadhyay wrote:

@Patrick I am seeing this issue on Ubuntu 18.04; doesn't seem to be related to testing pr.
Not sure if it's related as you pointed out it might be related to rhel.
can you take a look:
[...]
[...]
/ceph/teuthology-archive/yuriw-2021-04-01_15:23:17-rados-wip-yuri-testing-2021-03-31-1516-octopus-distro-basic-smithi/6014972/teuthology.log

Probably just the ceph-mgr getting blocklisted. I think it's a different issue.

Actions

Copy link

#10

Updated by Jeff Layton over 2 years ago

Status changed from New to Resolved

I think this was fixed upstream with commit 62575e270f661aba64778cbc5f354511cf9abb21, and that got backported to RHEL8.4 kernels.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #48439

fsstress failure with mds thrashing: "mds.0.6 Evicting (and blocklisting) client session 4564 (v1:172.21.15.47:0/603539598)"

Updated by Patrick Donnelly over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Jeff Layton over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Deepika Upadhyay about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton over 2 years ago