Project

General

Profile

Bug #42213

test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"

Added by Venky Shankar about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature:

Description

seen here: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-02_14:24:11-kcephfs-wip-yuri6-testing-2019-10-01-1605-nautilus-testing-basic-smithi/4351999/teuthology.log

MDS reached `reject` state ("up:active") rather than reaching "up:reconnect"

019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:ERROR: test_reconnect_eviction (tasks.cephfs.test_client_recovery.TestClientRecovery)
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/test_client_recovery.py", line 193, in test_reconnect_eviction
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_state('up:reconnect', reject='up:active', timeout=MDS_RESTART_GRACE)
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/filesystem.py", line 1016, in wait_for_state
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:    raise RuntimeError("MDS in reject state {0}".format(current_state))
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:RuntimeError: MDS in reject state up:active
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:

Related issues

Related to CephFS - Bug #40999: qa: AssertionError: u'open' != 'stale' Resolved
Copied to CephFS - Backport #42421: mimic: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" Rejected
Copied to CephFS - Backport #42422: nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" Resolved

History

#1 Updated by Patrick Donnelly about 1 year ago

  • Related to Bug #40999: qa: AssertionError: u'open' != 'stale' added

#2 Updated by Patrick Donnelly about 1 year ago

  • Subject changed from nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" to test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"
  • Assignee set to Venky Shankar
  • Priority changed from Normal to High

This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.

Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.

#3 Updated by Patrick Donnelly about 1 year ago

  • Target version changed from v14.2.5 to v15.0.0
  • Backport set to nautilus,mimic

#4 Updated by Venky Shankar about 1 year ago

Patrick Donnelly wrote:

This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.

Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.

ACK -- I'll take a look.

#5 Updated by Venky Shankar about 1 year ago

there's one more instance of this in test_reconnect_eviction() -- need to fix that too. I'll push a PR.

#6 Updated by Venky Shankar about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 30986

#7 Updated by Patrick Donnelly about 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#8 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #42421: mimic: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added

#9 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #42422: nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added

#10 Updated by Nathan Cutler 7 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF