Project

General

Profile

Bug #42213

test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"

Added by Venky Shankar 6 months ago. Updated 6 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature:

Description

seen here: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-02_14:24:11-kcephfs-wip-yuri6-testing-2019-10-01-1605-nautilus-testing-basic-smithi/4351999/teuthology.log

MDS reached `reject` state ("up:active") rather than reaching "up:reconnect"

019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:ERROR: test_reconnect_eviction (tasks.cephfs.test_client_recovery.TestClientRecovery)
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/test_client_recovery.py", line 193, in test_reconnect_eviction
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_state('up:reconnect', reject='up:active', timeout=MDS_RESTART_GRACE)
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/filesystem.py", line 1016, in wait_for_state
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:    raise RuntimeError("MDS in reject state {0}".format(current_state))
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:RuntimeError: MDS in reject state up:active
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:

Related issues

Related to fs - Bug #40999: qa: AssertionError: u'open' != 'stale' Resolved
Copied to fs - Backport #42421: mimic: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" New
Copied to fs - Backport #42422: nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" Resolved

History

#1 Updated by Patrick Donnelly 6 months ago

  • Related to Bug #40999: qa: AssertionError: u'open' != 'stale' added

#2 Updated by Patrick Donnelly 6 months ago

  • Subject changed from nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" to test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"
  • Assignee set to Venky Shankar
  • Priority changed from Normal to High

This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.

Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.

#3 Updated by Patrick Donnelly 6 months ago

  • Target version changed from v14.2.5 to v15.0.0
  • Backport set to nautilus,mimic

#4 Updated by Venky Shankar 6 months ago

Patrick Donnelly wrote:

This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.

Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.

ACK -- I'll take a look.

#5 Updated by Venky Shankar 6 months ago

there's one more instance of this in test_reconnect_eviction() -- need to fix that too. I'll push a PR.

#6 Updated by Venky Shankar 6 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 30986

#7 Updated by Patrick Donnelly 6 months ago

  • Status changed from Fix Under Review to Pending Backport

#8 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #42421: mimic: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added

#9 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #42422: nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added

Also available in: Atom PDF