Project

General

Profile

Actions

Bug #39651

open

qa: test_kill_mdstable fails unexpectedly

Added by Rishabh Dave almost 5 years ago. Updated almost 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I get following traceback while running the test_kill_mdstable: https://github.com/ceph/ceph/blob/master/qa/tasks/cephfs/test_snapshots.py#L41

File "/home/rishabh/repos/ceph/pr-27718/qa/tasks/cephfs/test_snapshots.py", line 76, in test_kill_mdstable
    self.delete_mds_coredump(rank0['name']);
  File "/home/rishabh/repos/ceph/pr-27718/qa/tasks/cephfs/cephfs_test_case.py", line 268, in delete_mds_coredump
    ], stdout=StringIO())
  File "../qa/tasks/vstart_runner.py", line 346, in run
    proc.wait()
  File "../qa/tasks/vstart_runner.py", line 179, in wait
    raise CommandFailedError(self.args, self.exitstatus)
CommandFailedError: Command failed with status 1: ['cd', '|/usr/lib/systemd', Raw('&&'), 'ls', Raw('|'), 'xargs', 'file']

code_dir:"https://github.com/ceph/ceph/blob/master/qa/tasks/cephfs/cephfs_test_case.py#L257" does not contain a path to a directory at all. The value of core_dir is "|/usr/lib/systemd" which is weird because the string (which is supposed to be a path) has a vertical bar at the beginning and, more importantly, because "/usr/lib/systemd" is not a directory. The lines of code following will attempt to use it as the target directory for "cd" command. Following is the traceback obtained from running "test_kill_mdstable"

Actions #1

Updated by Patrick Donnelly almost 5 years ago

  • Assignee set to Rishabh Dave
  • Target version set to v15.0.0
  • Start date deleted (05/09/2019)
  • Component(FS) qa-suite added
Actions #2

Updated by Patrick Donnelly almost 5 years ago

  • Subject changed from test_kill_mdstable fails unexpectedly to qa: test_kill_mdstable fails unexpectedly
  • Description updated (diff)
  • Source set to Q/A
Actions #3

Updated by Rishabh Dave over 4 years ago

Part of the problem is that the pipe character wasn't trimmed from output while extracting the path -

$ sysctl -n kernel.core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e

This is easy to fix; I've raised a PR for that - https://github.com/ceph/ceph/pull/31619.

But next part of the issue is that the assert in following code (it's from qa/tasks/cephfs/cephfs_test_case.py) fails -

    def delete_mds_coredump(self, daemon_id):
        # delete coredump file, otherwise teuthology.internal.coredump will
        # catch it later and treat it as a failure.
        path = self.mds_cluster.mds_daemons[daemon_id].remote.run(args=[
            "sudo", "sysctl", "-n", "kernel.core_pattern"],
            stdout=StringIO()).stdout.getvalue().strip()
        if path[0] == '|':
            path = path[1:]
        core_dir = os.path.dirname(path)

        if core_dir:  # Non-default core_pattern with a directory in it
            # We have seen a core_pattern that looks like it's from teuthology's coredump
            # task, so proceed to clear out the core file
            log.info("Clearing core from directory: {0}".format(core_dir))

            # Verify that we see the expected single coredump
            ls_proc = self.mds_cluster.mds_daemons[daemon_id].remote.run(args=[
                "cd", core_dir, run.Raw('&&'),
                "sudo", "ls", run.Raw('|'), "sudo", "xargs", "file" 
            ], stdout=StringIO())
            cores = [l.partition(":")[0]
                     for l in ls_proc.stdout.getvalue().strip().split("\n")
                     if re.match(r'.*ceph-mds.* -i +{0}'.format(daemon_id), l)]

            log.info("Enumerated cores: {0}".format(cores))
            self.assertEqual(len(cores), 1)

There's no "ceph-mds" in the ls_proc. I've got no idea about the significance of core file. @Patrick @Zheng any suggestions/hints you can give?

Actions #4

Updated by Rishabh Dave over 4 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Rishabh Dave over 4 years ago

I talked with Zheng. He told me that many tests cannot be executed successfully with vstart cluster and this is one of them.

Actions #6

Updated by Patrick Donnelly about 4 years ago

  • Target version changed from v15.0.0 to v16.0.0
Actions #7

Updated by Patrick Donnelly over 3 years ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport set to pacific,octopus,nautilus
Actions #8

Updated by Patrick Donnelly almost 2 years ago

  • Target version deleted (v17.0.0)
Actions

Also available in: Atom PDF