Project

General

Profile

Actions

Bug #44383

open

qa: MDS_CLIENT_LATE_RELEASE during MDS thrashing

Added by Patrick Donnelly about 4 years ago. Updated almost 2 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph, qa-suite
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-02-29T08:18:13.932 INFO:tasks.mds_thrash.fs.[cephfs]:mds.a has gained rank=0, replacing gid=4541
2020-02-29T08:18:13.932 INFO:tasks.mds_thrash.fs.[cephfs]:waiting for 9 secs before reviving mds.b
...
2020-02-29T08:22:55.701 INFO:teuthology.orchestra.run.smithi197.stdout:2020-02-29T08:19:18.807316+0000 mon.b (mon.0) 202 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)

From: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4810918/1$

Failure: "2020-02-29T08:19:18.807316+0000 mon.b (mon.0) 202 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)" in cluster log
2 jobs: ['4810918', '4811103']
suites intersection: ['clusters/1-mds-1-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/thrash/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'objectstore-ec/filestore-xfs.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'thrash-health-whitelist.yaml', 'thrashers/mds.yaml', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}', 'workloads/kclient_workunit_suites_ffsb.yaml}']
suites union: ['clusters/1-mds-1-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/thrash/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'objectstore-ec/filestore-xfs.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/testing/{flavor/centos_latest.yaml', 'overrides/{distro/testing/{flavor/ubuntu_latest.yaml', 'overrides/{frag_enable.yaml', 'thrash-health-whitelist.yaml', 'thrashers/mds.yaml', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}', 'workloads/kclient_workunit_suites_ffsb.yaml}']

Looks like a testing kernel regression.

Actions #2

Updated by Jeff Layton about 4 years ago

I saw this in my test run too at http://pulpito.ceph.com/jlayton-2020-03-06_16:21:14-kcephfs-master-distro-basic-smithi/4831121/

Trawling through the MDS logs, I see this:

smithi045/log/ceph-mds.a.log.gz:2020-03-06T17:33:38.600+0000 7f35674fb700  0 log_channel(cluster) log [WRN] : client.4639 isn't responding to mclientcaps(revoke), ino 0x10000000367 pending pAsLsXsFscb issued pAsLsXsFscb, sent 64.587981 seconds ago
smithi045/log/ceph-mds.a.log.gz:2020-03-06T17:33:38.600+0000 7f35674fb700 20 mds.0.locker caps_tick client.4639 isn't responding to mclientcaps(revoke), ino 0x10000000367 pending pAsLsXsFscb issued pAsLsXsFscb, sent 64.587981 seconds ago
smithi045/log/ceph-mds.a.log.gz:2020-03-06T17:33:38.600+0000 7f35674fb700  0 log_channel(cluster) log [WRN] : client.4639 isn't responding to mclientcaps(revoke), ino 0x100000003c0 pending pAsLsXsFsc issued pAsLsXsFsc, sent 64.578919 seconds ago
smithi045/log/ceph-mds.a.log.gz:2020-03-06T17:33:38.600+0000 7f35674fb700 20 mds.0.locker caps_tick client.4639 isn't responding to mclientcaps(revoke), ino 0x100000003c0 pending pAsLsXsFsc issued pAsLsXsFsc, sent 64.578919 seconds ago
2020-03-06T17:35:25.499+0000 7f058760d700  0 log_channel(cluster) log [WRN] : client.4639 isn't responding to mclientcaps(revoke), ino 0x100000003af pending pAsxLsXsxFsxcrwb issued pAsxLsXsxFsxcrwb, sent 61.508612 seconds ago
2020-03-06T17:35:25.499+0000 7f058760d700 20 mds.0.locker caps_tick client.4639 isn't responding to mclientcaps(revoke), ino 0x100000003af pending pAsxLsXsxFsxcrwb issued pAsxLsXsxFsxcrwb, sent 61.508612 seconds ago

...the pending and issued masks are identical in all 3 cases. Shouldn't they be different if there really is a cap revoke?

Actions #3

Updated by Patrick Donnelly about 4 years ago

  • Target version changed from v15.0.0 to v16.0.0
Actions #4

Updated by Patrick Donnelly over 3 years ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport set to pacific,octopus,nautilus
Actions #5

Updated by Patrick Donnelly almost 2 years ago

  • Target version deleted (v17.0.0)
Actions

Also available in: Atom PDF