Bug #59626
closedpacific: FSMissing: File system xxxx does not exist in the map
0%
Description
- TestMirroring.test_cephfs_mirror_cancel_sync: https://pulpito.ceph.com/yuriw-2023-04-27_14:53:50-fs-pacific-release-distro-default-smithi/7256090/ (and other pacific runs).
- TestMDSMetrics.test_delayed_metrics: https://pulpito.ceph.com/yuriw-2023-04-27_14:53:50-fs-pacific-release-distro-default-smithi/7256111
The traceback shows the test setup trying to initialize a client mount, however, the filesystem is not in the fsmap.
2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/test_mirroring.py", line 25, in setUp 2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner: super(TestMirroring, self).setUp() 2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/cephfs_test_case.py", line 181, in setUp 2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner: self.mounts[i].mount_wait() 2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/mount.py", line 409, in mount_wait 2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner: self.mount(**kwargs) 2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/fuse_mount.py", line 44, in mount 2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner: self.setupfs(name=self.cephfs_name) 2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/mount.py", line 161, in setupfs 2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner: self.fs.wait_for_daemons() 2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/filesystem.py", line 1140, in wait_for_daemons 2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner: status = self.getinfo(refresh=True) 2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/filesystem.py", line 549, in getinfo 2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner: fsmap = status.get_fsmap_byname(self.name) 2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/filesystem.py", line 131, in get_fsmap_byname 2023-04-27T17:30:30.033 INFO:tasks.cephfs_test_runner: raise FSMissing(name) 2023-04-27T17:30:30.033 INFO:tasks.cephfs_test_runner:tasks.cephfs.filesystem.FSMissing: File system backup_fs does not exist in the map
Updated by Venky Shankar 12 months ago
The pacific code is a bit different than other branches where the interesting bit is:
pacific:
def mount(self, mntopts=[], createfs=True, check_status=True, **kwargs): self.update_attrs(**kwargs) self.assert_and_log_minimum_mount_details() self.setup_netns() if createfs: # TODO: don't call setupfs() from within mount(), since it's # absurd. The proper order should be: create FS first and then # call mount(). self.setupfs(name=self.cephfs_name) try: return self._mount(mntopts, check_status) except RuntimeError: # Catch exceptions by the mount() logic (i.e. not remote command # failures) and ensure the mount is not left half-up. # Otherwise we might leave a zombie mount point that causes # anyone traversing cephtest/ to get hung up on. log.warning("Trying to clean up after failed mount") self.umount_wait(force=True) raise
quincy:
def mount(self, mntopts=[], check_status=True, **kwargs): self.update_attrs(**kwargs) self.assert_and_log_minimum_mount_details() self.setup_netns() try: return self._mount(mntopts, check_status) except RuntimeError: # Catch exceptions by the mount() logic (i.e. not remote command # failures) and ensure the mount is not left half-up. # Otherwise we might leave a zombie mount point that causes # anyone traversing cephtest/ to get hung up on. log.warning("Trying to clean up after failed mount") self.umount_wait(force=True) raise
The `if createfs` trickles into a check where the fsname is checked in the fsmap.
Updated by Venky Shankar 12 months ago
This commit removed the createfs boolean:
commit db0b85b44076ef952ca71033f3b5b819b8d768af Author: Rishabh Dave <ridave@redhat.com> Date: Fri Feb 19 21:33:45 2021 +0530 qa/cephfs: remove parameter createfs from mount classes It's better to get rid of this paramter since it doesn't actually create any new Ceph file system. Signed-off-by: Rishabh Dave <ridave@redhat.com>
This was done way back but the tests started failing now, so its not the underlying reason for the failure.
Updated by Rishabh Dave 12 months ago
Apparently, this exception is expected because backup_fs
is being deleted little before the traceback is printed and has not been re-created since.
2023-04-27T17:30:07.930 INFO:teuthology.orchestra.run.smithi086.stderr:backup_fs marked not joinable; MDS cannot join the cluster. All MDS ranks marked failed. 2023-04-27T17:30:07.942 DEBUG:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs rm backup_fs --yes-i-really-mean-it 2023-04-27T17:30:07.951 INFO:tasks.ceph.mds.a.smithi086.stdout:starting mds.a at 2023-04-27T17:30:08.944 DEBUG:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool rm backup_fs_metadata backup_fs_metadata --yes-i-really-really-mean-it 2023-04-27T17:30:09.946 INFO:teuthology.orchestra.run.smithi086.stderr:pool 'backup_fs_metadata' removed 2023-04-27T17:30:09.964 DEBUG:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool rm backup_fs_data backup_fs_data --yes-i-really-really-mean-it 2023-04-27T17:30:10.945 INFO:teuthology.orchestra.run.smithi086.stderr:pool 'backup_fs_data' removed
Here's the cephfs_test_case.py code where mount_wait() from traceback fails -
# Mount the requested number of clients for i in range(0, self.CLIENTS_REQUIRED): self.mounts[i].mount_wait()
Just below this, is the block that creates backup_fs
-
if self.REQUIRE_BACKUP_FILESYSTEM: if not self.REQUIRE_FILESYSTEM: self.skipTest("backup filesystem requires a primary filesystem as well") self.fs.mon_manager.raw_cluster_cmd('fs', 'flag', 'set', 'enable_multiple', 'true', '--yes-i-really-mean-it') self.backup_fs = self.mds_cluster.newfs(name="backup_fs") self.backup_fs.wait_for_daemons()
So if backup_fs
hasn't be created yet, this exception is natural. The question that arises is why is backup_fs
being expected when it hasn't been created yet? It expects so becasuse the mount object's cephfs_name
attribute has been to set to it. I know this because this cephfs_name
is passed to Filesystem.__init__()
in CephFSMount.setupfs()
.
def setupfs(self, name=None): if name is None and self.fs is not None: # Previous mount existed, reuse the old name name = self.fs.name self.fs = Filesystem(self.ctx, name=name) log.info('Wait for MDS to reach steady state...') self.fs.wait_for_daemons() log.info('Ready to start {}...'.format(type(self).__name__))
And then this Filesystem instance is used to call wait_for_daemons()
which in turn calls getinfo()
and which in turns get_fsmap_byname()
. get_fsmap_byname()
is where the exception occurs. So the bug lies with mount object's attirbute cephfs_name being incorrectly set to backup_fs
.
Perhaps, this is a case of unclean teardown. Just after backup_fs
is deleted from cluster, the teuthology.log contains the entry for completion of test test_cephfs_mirror_cancel_mirroring_and_readd
.
2023-04-27T17:30:11.959 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_mirroring_and_readd (tasks.cephfs.test_mirroring.TestMirroring) ... ok
And this test indeed sets mount object to backup_fs
-
From TestMirroring.test_cephfs_mirror_cancel_mirroring_and_readd
self.mount_b.mount_wait(cephfs_name=self.secondary_fs_name)
From TestMirroring.setUp()
-
self.secondary_fs_name = self.backup_fs.name
This still doesn't explain why this failure hasn't been seen yet on quincy, reef and main.
============================
Mount Details are saved and restored at the beginning and end (respectively) of every test case. So teardown occurs but it is not clean. The unclean part is in setupfs, specifically following part of it -
# Previous mount existed, reuse the old name name = self.fs.name
setupfs()
is not being called quincy onwards which is why we don't see this bug after pacific. Backporting commit db0b85b44076ef952ca71033f3b5b819b8d768af
should fix this issue. For this failure, this is the root cause. Why Pacific tests fails now and did not before? It's due tot this PR - https://github.com/ceph/ceph/pull/50923/files.
Updated by Venky Shankar 12 months ago
... and also doesn't explain why are we seeing this failure now. The last pacific run https://tracker.ceph.com/projects/cephfs/wiki/Pacific#2023-Apr-13 didn't have this failure.
Updated by Venky Shankar 12 months ago
Rishabh Dave wrote:
setupfs()
is not being called quincy onwards which is why we don't see this bug after pacific. Backporting commitdb0b85b44076ef952ca71033f3b5b819b8d768af
should fix this issue. For this failure, this is the root cause. Why Pacific tests fails now and did not before? It's due tot this PR - https://github.com/ceph/ceph/pull/50923/files.
Right. I am testing by reverting this PR, but running into teuthology infra issues :/
Updated by Rishabh Dave 12 months ago
- Subject changed from pacific: Test failure: test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring) to pacific: FSMissing: File system fs1 does not exist in the map
- Description updated (diff)
Updated by Rishabh Dave 12 months ago
- Status changed from New to Fix Under Review
- Priority changed from Normal to High
- Pull request ID set to 51344
Updated by Rishabh Dave 12 months ago
- Subject changed from pacific: FSMissing: File system fs1 does not exist in the map to pacific: FSMissing: File system xxxx does not exist in the map
Updated by Yuri Weinstein 12 months ago
Updated by Venky Shankar 12 months ago
- Status changed from Fix Under Review to Resolved