Project

General

Profile

Bug #18663

teuthology teardown hangs if kclient umount fails

Added by John Spray about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/jspray-2017-01-25_02:52:36-multimds-wip-jcsp-testing-20170124-testing-basic-smithi/744603/teuthology.log

In this instance, we see a kernel umount fail:

2017-01-25T06:40:59.930 INFO:tasks.workunit:Stopping ['kernel_untar_build.sh'] on client.0...
2017-01-25T06:40:59.930 INFO:teuthology.orchestra.run.smithi093:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
2017-01-25T06:41:00.294 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 83, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/workunit.py", line 415, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 192, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 403, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 166, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test kernel_untar_build.sh) on smithi093 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=84a02ccb5eb10a869ac608bf95973eefcf2f45bd TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/kernel_untar_build.sh'
2017-01-25T06:41:00.295 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/workunit.py", line 125, in task
    config.get('subdir'), timeout=timeout)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/workunit.py", line 273, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 83, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/workunit.py", line 415, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 192, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 403, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 166, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test kernel_untar_build.sh) on smithi093 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=84a02ccb5eb10a869ac608bf95973eefcf2f45bd TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/kernel_untar_build.sh'
2017-01-25T06:41:00.317 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/?q=26ab6624dc1849f5b8a9619b435a6927
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/workunit.py", line 125, in task
    config.get('subdir'), timeout=timeout)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/workunit.py", line 273, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 83, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/workunit.py", line 415, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 192, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 403, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 166, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test kernel_untar_build.sh) on smithi093 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=84a02ccb5eb10a869ac608bf95973eefcf2f45bd TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/kernel_untar_build.sh'
2017-01-25T06:41:00.335 DEBUG:teuthology.run_tasks:Unwinding manager kclient
2017-01-25T06:41:00.757 INFO:tasks.kclient:Unmounting kernel clients...
2017-01-25T06:41:00.757 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
2017-01-25T06:41:00.757 INFO:teuthology.orchestra.run.smithi093:Running: 'sudo umount /home/ubuntu/cephtest/mnt.0'
2017-01-25T06:41:00.868 INFO:teuthology.orchestra.run.smithi093.stderr:umount: /home/ubuntu/cephtest/mnt.0: target is busy
2017-01-25T06:41:00.868 INFO:teuthology.orchestra.run.smithi093.stderr:        (In some cases useful info about processes that
2017-01-25T06:41:00.868 INFO:teuthology.orchestra.run.smithi093.stderr:         use the device is found by lsof(8) or fuser(1).)
2017-01-25T06:41:00.870 ERROR:teuthology.run_tasks:Manager failed: kclient
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 159, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/kclient.py", line 114, in task
    mount.umount()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-jcsp-testing-20170124/qa/tasks/cephfs/kernel_mount.py", line 97, in umount
    self.client_remote.run(args=cmd)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 192, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 403, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 166, in wait
    label=self.label)
CommandFailedError: Command failed on smithi093 with status 32: 'sudo umount /home/ubuntu/cephtest/mnt.0'
2017-01-25T06:41:00.871 DEBUG:teuthology.run_tasks:Unwinding manager ceph

Then much later on when teuthology is trying to tear down its cephtest directories:

2017-01-25T08:44:02.962 INFO:teuthology.orchestra.run.smithi161.stdout:3146168    4 drwxr-xr-x   2 ubuntu   ubuntu       4096 Jan 25 08:44 /home/ubuntu/cephtest
2017-01-25T08:44:03.074 INFO:teuthology.orchestra.run.smithi027.stdout:20447670    4 drwxr-xr-x   2 ubuntu   ubuntu       4096 Jan 25 08:44 /home/ubuntu/cephtest
2017-01-25T11:59:17.986 INFO:teuthology.orchestra.run.smithi093.stdout: 52953407      4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Jan 25 08:44 /home/ubuntu/cephtest
2017-01-25T11:59:17.987 INFO:teuthology.orchestra.run.smithi093.stderr:find: '/home/ubuntu/cephtest/mnt.0': Input/output error
2017-01-25T11:59:17.989 INFO:teuthology.orchestra.run.smithi093.stderr:rmdir: failed to remove '/home/ubuntu/cephtest': Directory not empty

Note that it didn't complete on its own: that long wait + EIO was it proceeding after I manually logged into the node and did a "umount -f" on the mount path. The umount -f did not itself succeed (also got "target is busy") but it was sufficient to dislodge the stuck find process.

History

#1 Updated by John Spray about 7 years ago

  • Status changed from New to Fix Under Review

#2 Updated by Zheng Yan about 7 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF