Bug #59626: pacific: FSMissing: File system xxxx does not exist in the map - CephFS - Ceph

Actions

Copy link

Bug #59626

closed

pacific: FSMissing: File system xxxx does not exist in the map

Added by Venky Shankar 12 months ago. Updated 12 months ago.

Status:

Resolved

Priority:

High

Assignee:

Rishabh Dave

Category:

Testing

Target version:

Ceph - v16.2.13

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

51344

Crash signature (v1):

Crash signature (v2):

Description

Two separate jobs fail due to this issue

TestMirroring.test_cephfs_mirror_cancel_sync: https://pulpito.ceph.com/yuriw-2023-04-27_14:53:50-fs-pacific-release-distro-default-smithi/7256090/ (and other pacific runs).
TestMDSMetrics.test_delayed_metrics: https://pulpito.ceph.com/yuriw-2023-04-27_14:53:50-fs-pacific-release-distro-default-smithi/7256111

The traceback shows the test setup trying to initialize a client mount, however, the filesystem is not in the fsmap.

2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/test_mirroring.py", line 25, in setUp
2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner:    super(TestMirroring, self).setUp()
2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/cephfs_test_case.py", line 181, in setUp
2023-04-27T17:30:30.030 INFO:tasks.cephfs_test_runner:    self.mounts[i].mount_wait()
2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/mount.py", line 409, in mount_wait
2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner:    self.mount(**kwargs)
2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/fuse_mount.py", line 44, in mount
2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner:    self.setupfs(name=self.cephfs_name)
2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/mount.py", line 161, in setupfs
2023-04-27T17:30:30.031 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_daemons()
2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/filesystem.py", line 1140, in wait_for_daemons
2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner:    status = self.getinfo(refresh=True)
2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/filesystem.py", line 549, in getinfo
2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner:    fsmap = status.get_fsmap_byname(self.name)
2023-04-27T17:30:30.032 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_rishabh-d-dave_ceph_acbc33208f1a8ff19138ed7b704f4ebf04cece8c/qa/tasks/cephfs/filesystem.py", line 131, in get_fsmap_byname
2023-04-27T17:30:30.033 INFO:tasks.cephfs_test_runner:    raise FSMissing(name)
2023-04-27T17:30:30.033 INFO:tasks.cephfs_test_runner:tasks.cephfs.filesystem.FSMissing: File system backup_fs does not exist in the map

Actions

Copy link

Updated by Venky Shankar 12 months ago

The pacific code is a bit different than other branches where the interesting bit is:

pacific:

   def mount(self, mntopts=[], createfs=True, check_status=True, **kwargs):
        self.update_attrs(**kwargs)
        self.assert_and_log_minimum_mount_details()

        self.setup_netns()

        if createfs:
            # TODO: don't call setupfs() from within mount(), since it's
            # absurd. The proper order should be: create FS first and then
            # call mount().
            self.setupfs(name=self.cephfs_name)

        try:
            return self._mount(mntopts, check_status)
        except RuntimeError:
            # Catch exceptions by the mount() logic (i.e. not remote command
            # failures) and ensure the mount is not left half-up.
            # Otherwise we might leave a zombie mount point that causes
            # anyone traversing cephtest/ to get hung up on.
            log.warning("Trying to clean up after failed mount")
            self.umount_wait(force=True)
            raise

quincy:

    def mount(self, mntopts=[], check_status=True, **kwargs):
        self.update_attrs(**kwargs)
        self.assert_and_log_minimum_mount_details()

        self.setup_netns()

        try:
            return self._mount(mntopts, check_status)
        except RuntimeError:
            # Catch exceptions by the mount() logic (i.e. not remote command
            # failures) and ensure the mount is not left half-up.
            # Otherwise we might leave a zombie mount point that causes
            # anyone traversing cephtest/ to get hung up on.
            log.warning("Trying to clean up after failed mount")
            self.umount_wait(force=True)
            raise

The `if createfs` trickles into a check where the fsname is checked in the fsmap.

Actions

Copy link

Updated by Venky Shankar 12 months ago

This commit removed the createfs boolean:

commit db0b85b44076ef952ca71033f3b5b819b8d768af
Author: Rishabh Dave <ridave@redhat.com>
Date:   Fri Feb 19 21:33:45 2021 +0530

    qa/cephfs: remove parameter createfs from mount classes

    It's better to get rid of this paramter since it doesn't actually create
    any new Ceph file system.

    Signed-off-by: Rishabh Dave <ridave@redhat.com>

This was done way back but the tests started failing now, so its not the underlying reason for the failure.

Actions

Copy link

Updated by Rishabh Dave 12 months ago

Apparently, this exception is expected because backup_fs is being deleted little before the traceback is printed and has not been re-created since.

2023-04-27T17:30:07.930 INFO:teuthology.orchestra.run.smithi086.stderr:backup_fs marked not joinable; MDS cannot join the cluster. All MDS ranks marked failed.
2023-04-27T17:30:07.942 DEBUG:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs rm backup_fs --yes-i-really-mean-it
2023-04-27T17:30:07.951 INFO:tasks.ceph.mds.a.smithi086.stdout:starting mds.a at
2023-04-27T17:30:08.944 DEBUG:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool rm backup_fs_metadata backup_fs_metadata --yes-i-really-really-mean-it
2023-04-27T17:30:09.946 INFO:teuthology.orchestra.run.smithi086.stderr:pool 'backup_fs_metadata' removed
2023-04-27T17:30:09.964 DEBUG:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool rm backup_fs_data backup_fs_data --yes-i-really-really-mean-it
2023-04-27T17:30:10.945 INFO:teuthology.orchestra.run.smithi086.stderr:pool 'backup_fs_data' removed

Here's the cephfs_test_case.py code where mount_wait() from traceback fails -

            # Mount the requested number of clients
            for i in range(0, self.CLIENTS_REQUIRED):
                self.mounts[i].mount_wait()

Just below this, is the block that creates backup_fs -

        if self.REQUIRE_BACKUP_FILESYSTEM:
            if not self.REQUIRE_FILESYSTEM:
                self.skipTest("backup filesystem requires a primary filesystem as well")
            self.fs.mon_manager.raw_cluster_cmd('fs', 'flag', 'set',
                                                'enable_multiple', 'true',
                                                '--yes-i-really-mean-it')
            self.backup_fs = self.mds_cluster.newfs(name="backup_fs")
            self.backup_fs.wait_for_daemons()

So if backup_fs hasn't be created yet, this exception is natural. The question that arises is why is backup_fs being expected when it hasn't been created yet? It expects so becasuse the mount object's cephfs_name attribute has been to set to it. I know this because this cephfs_name is passed to Filesystem.__init__() in CephFSMount.setupfs().

    def setupfs(self, name=None):
        if name is None and self.fs is not None:
            # Previous mount existed, reuse the old name
            name = self.fs.name
        self.fs = Filesystem(self.ctx, name=name)
        log.info('Wait for MDS to reach steady state...')
        self.fs.wait_for_daemons()
        log.info('Ready to start {}...'.format(type(self).__name__))

And then this Filesystem instance is used to call wait_for_daemons() which in turn calls getinfo() and which in turns get_fsmap_byname(). get_fsmap_byname() is where the exception occurs. So the bug lies with mount object's attirbute cephfs_name being incorrectly set to backup_fs.

Perhaps, this is a case of unclean teardown. Just after backup_fs is deleted from cluster, the teuthology.log contains the entry for completion of test test_cephfs_mirror_cancel_mirroring_and_readd.

2023-04-27T17:30:11.959 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_mirroring_and_readd (tasks.cephfs.test_mirroring.TestMirroring) ... ok

And this test indeed sets mount object to backup_fs -
From TestMirroring.test_cephfs_mirror_cancel_mirroring_and_readd

        self.mount_b.mount_wait(cephfs_name=self.secondary_fs_name)

From TestMirroring.setUp() -

        self.secondary_fs_name = self.backup_fs.name

This still doesn't explain why this failure hasn't been seen yet on quincy, reef and main.

============================

Mount Details are saved and restored at the beginning and end (respectively) of every test case. So teardown occurs but it is not clean. The unclean part is in setupfs, specifically following part of it -

            # Previous mount existed, reuse the old name
            name = self.fs.name

setupfs() is not being called quincy onwards which is why we don't see this bug after pacific. Backporting commit db0b85b44076ef952ca71033f3b5b819b8d768af should fix this issue. For this failure, this is the root cause. Why Pacific tests fails now and did not before? It's due tot this PR - https://github.com/ceph/ceph/pull/50923/files.

Actions

Copy link

Updated by Venky Shankar 12 months ago

... and also doesn't explain why are we seeing this failure now. The last pacific run https://tracker.ceph.com/projects/cephfs/wiki/Pacific#2023-Apr-13 didn't have this failure.

Actions

Copy link

Updated by Venky Shankar 12 months ago

Rishabh Dave wrote:

setupfs() is not being called quincy onwards which is why we don't see this bug after pacific. Backporting commit db0b85b44076ef952ca71033f3b5b819b8d768af should fix this issue. For this failure, this is the root cause. Why Pacific tests fails now and did not before? It's due tot this PR - https://github.com/ceph/ceph/pull/50923/files.

Right. I am testing by reverting this PR, but running into teuthology infra issues :/

Actions

Copy link

Updated by Rishabh Dave 12 months ago

Subject changed from pacific: Test failure: test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring) to pacific: FSMissing: File system fs1 does not exist in the map
Description updated (diff)

Actions

Copy link

Updated by Rishabh Dave 12 months ago

Status changed from New to Fix Under Review
Priority changed from Normal to High
Pull request ID set to 51344

Actions

Copy link

Updated by Rishabh Dave 12 months ago

Subject changed from pacific: FSMissing: File system fs1 does not exist in the map to pacific: FSMissing: File system xxxx does not exist in the map

Actions

Copy link