Bug #13334: delayed revoke warning in test_client_recovery test - CephFS - Ceph

Actions

Copy link

Bug #13334

closed

delayed revoke warning in test_client_recovery test

Added by Greg Farnum over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Testing

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://pulpito.ceph.com/teuthology-2015-09-29_23:04:01-fs-infernalis---basic-multi/1077881/

2015-10-01T03:13:41.515 INFO:teuthology.run:Summary data:
{description: 'fs/recovery/{clusters/2-remote-clients.yaml debug/mds_client.yaml dirfrag/frag_enable.yaml
    mounts/ceph-fuse.yaml tasks/client-recovery.yaml}', duration: 1106.058995962143,
  failure_reason: '"2015-10-01 03:06:53.013570 mds.0 10.214.134.104:6806/20139 5 :
    cluster [WRN] client.4537 isn''t responding to mclientcaps(revoke), ino 10000000000
    pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 60.308408 seconds ago" in cluster
    log', flavor: basic, owner: scheduled_teuthology@teuthology, success: false}

I think maybe we just need to whitelist this warning, since we do a lot of skewing around. But perhaps something has gone horribly wrong.

Actions

Copy link

Updated by John Spray over 8 years ago

Yeah, this is racy because mds_session_timeout is 60s, and so is the threshold for emitting that warning.

Actually, we should probably change the timeouts in ceph, because having those two close together means that in the case of dead client, followed by attempt by another client to access a file held by the dead client, users will also see this nondeterministic behaviour where sometimes they get the "isn't responding to" message before the client's evicted and sometimes they don't.

But yeah, the test should whitelist this message anyway.

Actions

Copy link

Updated by John Spray over 8 years ago

Oh, it's even simpler. Can just switch the order of locker->tick and server->find_idle_sessions to get rid of this behaviour when the timeouts are the same.

Actions

Copy link