Project

General

Profile

Bug #13334

delayed revoke warning in test_client_recovery test

Added by Greg Farnum almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Testing
Target version:
-
Start date:
10/02/2015
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:

Description

http://pulpito.ceph.com/teuthology-2015-09-29_23:04:01-fs-infernalis---basic-multi/1077881/

2015-10-01T03:13:41.515 INFO:teuthology.run:Summary data:
{description: 'fs/recovery/{clusters/2-remote-clients.yaml debug/mds_client.yaml dirfrag/frag_enable.yaml
    mounts/ceph-fuse.yaml tasks/client-recovery.yaml}', duration: 1106.058995962143,
  failure_reason: '"2015-10-01 03:06:53.013570 mds.0 10.214.134.104:6806/20139 5 :
    cluster [WRN] client.4537 isn''t responding to mclientcaps(revoke), ino 10000000000
    pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 60.308408 seconds ago" in cluster
    log', flavor: basic, owner: scheduled_teuthology@teuthology, success: false}

I think maybe we just need to whitelist this warning, since we do a lot of skewing around. But perhaps something has gone horribly wrong.

Associated revisions

Revision e2e1bd9c (diff)
Added by John Spray almost 4 years ago

mds: avoid emitting cap warnings before evicting session

In the case where a client dies, and another client immediately
tries to access a file locked by the dead client, we would
previously sometimes emit a "client.xyz isn't responding to
mclientcaps" warning to the cluster log, right before
evicting the stale session. This was because the timeout
for the session eviction and the timeout for the
warning message are both 60s.

Fix this by checking the stale sessions before doing the
warning message check in Locker. If a session is going
to get evicted in this tick, it will already be gone
by the time Locker thinks about emitting the warning
message.

Fixes: #13334
Signed-off-by: John Spray <>

History

#1 Updated by John Spray almost 4 years ago

Yeah, this is racy because mds_session_timeout is 60s, and so is the threshold for emitting that warning.

Actually, we should probably change the timeouts in ceph, because having those two close together means that in the case of dead client, followed by attempt by another client to access a file held by the dead client, users will also see this nondeterministic behaviour where sometimes they get the "isn't responding to" message before the client's evicted and sometimes they don't.

But yeah, the test should whitelist this message anyway.

#2 Updated by John Spray almost 4 years ago

Oh, it's even simpler. Can just switch the order of locker->tick and server->find_idle_sessions to get rid of this behaviour when the timeouts are the same.

#3 Updated by Greg Farnum almost 4 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF