Project

General

Profile

Bug #42348

TestClientRecovery.test_dont_mark_unresponsive_client_stale failure

Added by Venky Shankar 4 months ago. Updated 4 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature:

Description

Saw this in mimic, seems to exist in master too: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-16_13:28:41-fs-wip-yuri-testing-2019-10-15-1629-mimic-testing-basic-smithi/4416565/teuthology.log

2019-10-16T17:13:50.319 INFO:tasks.cephfs_test_runner:======================================================================
2019-10-16T17:13:50.319 INFO:tasks.cephfs_test_runner:FAIL: test_dont_mark_unresponsive_client_stale (tasks.cephfs.test_client_recovery.TestClientRecovery)
2019-10-16T17:13:50.319 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2019-10-16T17:13:50.319 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2019-10-16T17:13:50.319 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2019-10-15-1629-mimic/qa/tasks/cephfs/test_client_recovery.py", line 558, in test_dont_mark_unresponsive_client_stale
2019-10-16T17:13:50.320 INFO:tasks.cephfs_test_runner:    self.assert_session_count(1, self.fs.mds_asok(['session', 'ls']))
2019-10-16T17:13:50.320 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri-testing-2019-10-15-1629-mimic/qa/tasks/cephfs/cephfs_test_case.py", line 213, in assert_session_count
2019-10-16T17:13:50.320 INFO:tasks.cephfs_test_runner:    expected, alive_count
2019-10-16T17:13:50.320 INFO:tasks.cephfs_test_runner:AssertionError: Expected 1 sessions, found 2

This comes from this part of the test:

        # test that other clients have to wait to get the caps from                                                                   
        # unresponsive client until session_autoclose.                                                                                
        self.mount_b.run_shell(['stat', 'dir'])
        self.assert_session_count(1, self.fs.mds_asok(['session', 'ls']))
        self.assertLess(time.time(), time_at_beg + SESSION_AUTOCLOSE)

I think just we can just delay fetching session list after auto closing the session since `evict_client()` is invoked as:

    if (g_conf->mds_session_blacklist_on_timeout) {
      std::stringstream ss;
      mds->evict_client(session->get_client().v, false, true,
                        ss, nullptr);
    } else {
      kill_session(session, NULL);
    }

@Patrick, @Rishabh

History

#1 Updated by Venky Shankar 4 months ago

This was not as straightforward as I suggested. However, it's due to PR https://github.com/ceph/ceph/pull/28585 not being included in the list of (mimic) backports to test but the test case for this PR was included. The test case was a separate PR in master: https://github.com/ceph/ceph/pull/22645.

The interesting part is that the test passed once out of 20 odd runs locally with vstart_runner.

Would marking the tracker tickets as dependent lessen the chances of running into this problem in the future?

#2 Updated by Patrick Donnelly 4 months ago

Venky Shankar wrote:

This was not as straightforward as I suggested. However, it's due to PR https://github.com/ceph/ceph/pull/28585 not being included in the list of (mimic) backports to test but the test case for this PR was included. The test case was a separate PR in master: https://github.com/ceph/ceph/pull/22645.

The interesting part is that the test passed once out of 20 odd runs locally with vstart_runner.

Would marking the tracker tickets as dependent lessen the chances of running into this problem in the future?

I think normally we'd just merge the backports into one PR but I forgot to write a note about that. Sorry!

#3 Updated by Patrick Donnelly 4 months ago

  • Status changed from New to Rejected

Just needs another backport.

Also available in: Atom PDF