Project

General

Profile

Bug #10702

ceph-qa-suite: hung client-recovery task in nightlies

Added by Greg Farnum about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/teuthology-2015-01-28_23:04:02-fs-master-testing-basic-multi/729695/

I gathered the logs from both machines as well. I haven't looked in great detail, but it looks like the client got stuck trying to flush things to the MDS (which at first glance it should have) and then the test didn't proceed. These tests normally pass though, so something is different...

Associated revisions

Revision 03b0e106 (diff)
Added by John Spray about 9 years ago

tasks/cephfs: fix fuse force unmount

This was broken in the case of multiple
mounts, and in the case of stuck mounts.

Fixes: #10702
Signed-off-by: John Spray <>

History

#1 Updated by Zheng Yan about 9 years ago

it hung at

    def test_reconnect_timeout(self):
        # Reconnect timeout
        # =================
        # Check that if I stop an MDS and a client goes away, the MDS waits
        # for the reconnect period
        self.fs.mds_stop()
        self.fs.mds_fail()

        mount_a_client_id = self.mount_a.get_global_id()
        self.mount_a.umount_wait(force=True)

umount_wait() hung at _mountpoint_exists()

    def umount(self):
        try:
            log.info('Running fusermount -u on {name}...'.format(name=self.client_remote.name))
            self.client_remote.run(
                args=[
                    'sudo',
                    'fusermount',
                    '-u',
                    self.mountpoint,
                ],
            )
        except run.CommandFailedError:
            log.info('Failed to unmount ceph-fuse on {name}, aborting...'.format(name=self.client_remote.name))
            # abort the fuse mount, killing all hung processes
            self.client_remote.run(
                args=[
                    'if', 'test', '-e', '/sys/fs/fuse/connections/*/abort',
                    run.Raw(';'), 'then',
                    'echo',
                    '1',
                    run.Raw('>'),
                    run.Raw('/sys/fs/fuse/connections/*/abort'),
                    run.Raw(';'), 'fi',
                ],
            )
            # make sure its unmounted
            if self._mountpoint_exists():

_mountpoint_exists() executes 'ls -d /home/ubuntu/cephtest/mnt.1' but the MDS have already stopped.

the hung happen only when umount fails.

failed to unmount /home/ubuntu/cephtest/mnt.1: Device or resource busy

#2 Updated by Greg Farnum about 9 years ago

Isn't the first umount supposed to fail, and then the "except run.CommandFailedError:" block forcibly cleans it up?

Although I think we can probably switch up mountpoint_exists to be based off of "df" or "mount" parsing instead of an "ls", which will help avoid the hangs.

#3 Updated by Zheng Yan about 9 years ago

no, the first umount supposed to success. but it fails with error "Device or resource busy".

I think the simplest fix is remove the self._mountpoint_exists() check.

#4 Updated by John Spray about 9 years ago

It's weird that the fusermount u is failing - usually if you have an offline MDS and an idle fuse mount and you fusermount -u it, then the fusermount call will succeed, but the ceph-fuse process won't terminate until it can talk to the MDS.

That's the flow that this code usually sees, anyway. I don't know why that isn't happening here :-/

I agree with Zheng that the _mountpoint_exists thing needs to not be there, because when fusermount fails we'll block like this, but we may need to also add an exception handler around the umount call so that if the mount point doesn't exist we don't bail out.

#5 Updated by John Spray about 9 years ago

Ah, having seen this fail in an interactive run, I now also realise that the /sys/fs/fuse/connections/*/abort' is bogus in the case of multiple FUSE mounts, it needs to be a while loop over any connections that exist.

#6 Updated by Greg Farnum about 9 years ago

  • Status changed from New to Pending Backport
  • Assignee set to John Spray

Merged to master in commit:0dce67d6bb83495b9d7f0c6cdfd9cd4bf193c749. We might also need to "backport" it to hammer, but I'm checking on if we're merging those branches together first.

#7 Updated by Greg Farnum about 9 years ago

  • Status changed from Pending Backport to Resolved

Sage will merge the branches as long as he's doing so for Ceph.

Also available in: Atom PDF