Bug #10702
ceph-qa-suite: hung client-recovery task in nightlies
0%
Description
http://pulpito.ceph.com/teuthology-2015-01-28_23:04:02-fs-master-testing-basic-multi/729695/
I gathered the logs from both machines as well. I haven't looked in great detail, but it looks like the client got stuck trying to flush things to the MDS (which at first glance it should have) and then the test didn't proceed. These tests normally pass though, so something is different...
Associated revisions
tasks/cephfs: fix fuse force unmount
This was broken in the case of multiple
mounts, and in the case of stuck mounts.
Fixes: #10702
Signed-off-by: John Spray <john.spray@redhat.com>
History
#1 Updated by Zheng Yan about 9 years ago
it hung at
def test_reconnect_timeout(self): # Reconnect timeout # ================= # Check that if I stop an MDS and a client goes away, the MDS waits # for the reconnect period self.fs.mds_stop() self.fs.mds_fail() mount_a_client_id = self.mount_a.get_global_id() self.mount_a.umount_wait(force=True)
umount_wait() hung at _mountpoint_exists()
def umount(self): try: log.info('Running fusermount -u on {name}...'.format(name=self.client_remote.name)) self.client_remote.run( args=[ 'sudo', 'fusermount', '-u', self.mountpoint, ], ) except run.CommandFailedError: log.info('Failed to unmount ceph-fuse on {name}, aborting...'.format(name=self.client_remote.name)) # abort the fuse mount, killing all hung processes self.client_remote.run( args=[ 'if', 'test', '-e', '/sys/fs/fuse/connections/*/abort', run.Raw(';'), 'then', 'echo', '1', run.Raw('>'), run.Raw('/sys/fs/fuse/connections/*/abort'), run.Raw(';'), 'fi', ], ) # make sure its unmounted if self._mountpoint_exists():
_mountpoint_exists() executes 'ls -d /home/ubuntu/cephtest/mnt.1' but the MDS have already stopped.
the hung happen only when umount fails.
failed to unmount /home/ubuntu/cephtest/mnt.1: Device or resource busy
#2 Updated by Greg Farnum about 9 years ago
Isn't the first umount supposed to fail, and then the "except run.CommandFailedError:" block forcibly cleans it up?
Although I think we can probably switch up mountpoint_exists to be based off of "df" or "mount" parsing instead of an "ls", which will help avoid the hangs.
#3 Updated by Zheng Yan about 9 years ago
no, the first umount supposed to success. but it fails with error "Device or resource busy".
I think the simplest fix is remove the self._mountpoint_exists() check.
#4 Updated by John Spray about 9 years ago
It's weird that the fusermount u is failing - usually if you have an offline MDS and an idle fuse mount and you fusermount -u it, then the fusermount call will succeed, but the ceph-fuse process won't terminate until it can talk to the MDS.
That's the flow that this code usually sees, anyway. I don't know why that isn't happening here :-/
I agree with Zheng that the _mountpoint_exists thing needs to not be there, because when fusermount fails we'll block like this, but we may need to also add an exception handler around the umount call so that if the mount point doesn't exist we don't bail out.
#5 Updated by John Spray about 9 years ago
Ah, having seen this fail in an interactive run, I now also realise that the /sys/fs/fuse/connections/*/abort' is bogus in the case of multiple FUSE mounts, it needs to be a while loop over any connections that exist.
#6 Updated by Greg Farnum about 9 years ago
- Status changed from New to Pending Backport
- Assignee set to John Spray
Merged to master in commit:0dce67d6bb83495b9d7f0c6cdfd9cd4bf193c749. We might also need to "backport" it to hammer, but I'm checking on if we're merging those branches together first.
#7 Updated by Greg Farnum about 9 years ago
- Status changed from Pending Backport to Resolved
Sage will merge the branches as long as he's doing so for Ceph.