Bug #42213
closed
test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"
Added by Venky Shankar over 4 years ago.
Updated almost 4 years ago.
Description
seen here: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-02_14:24:11-kcephfs-wip-yuri6-testing-2019-10-01-1605-nautilus-testing-basic-smithi/4351999/teuthology.log
MDS reached `reject` state ("up:active") rather than reaching "up:reconnect"
019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:ERROR: test_reconnect_eviction (tasks.cephfs.test_client_recovery.TestClientRecovery)
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/test_client_recovery.py", line 193, in test_reconnect_eviction
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner: self.fs.wait_for_state('up:reconnect', reject='up:active', timeout=MDS_RESTART_GRACE)
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/filesystem.py", line 1016, in wait_for_state
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner: raise RuntimeError("MDS in reject state {0}".format(current_state))
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:RuntimeError: MDS in reject state up:active
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:
- Related to Bug #40999: qa: AssertionError: u'open' != 'stale' added
- Subject changed from nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" to test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"
- Assignee set to Venky Shankar
- Priority changed from Normal to High
This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.
Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.
- Target version changed from v14.2.5 to v15.0.0
- Backport set to nautilus,mimic
Patrick Donnelly wrote:
This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.
Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.
ACK -- I'll take a look.
there's one more instance of this in test_reconnect_eviction() -- need to fix that too. I'll push a PR.
- Status changed from New to Fix Under Review
- Pull request ID set to 30986
- Status changed from Fix Under Review to Pending Backport
- Copied to Backport #42421: mimic: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added
- Copied to Backport #42422: nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Also available in: Atom
PDF