Project

General

Profile

Bug #16886

multimds: kclient hang (?) in tests

Added by Patrick Donnelly about 3 years ago. Updated 6 months ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
08/01/2016
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
multimds
Pull request ID:

Description

There are strange pauses which are showing up in several tests for the kclient:

2016-07-23T20:36:24.517 INFO:tasks.workunit:Running workunit fs/misc/dirfrag.sh...
2016-07-23T20:36:24.517 INFO:teuthology.orchestra.run.mira066:Running (workunit test fs/misc/dirfrag.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=a0294e64507a7916fdd9707ae22ba40b0d7b65d1 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 5h /home/ubuntu/cephtest/workunit.client.0/fs/misc/dirfrag.sh'
2016-07-23T20:36:24.643 INFO:tasks.workunit.client.0.mira066.stdout:creating folder hierarchy
2016-07-23T20:40:39.985 INFO:tasks.workunit.client.0.mira066.stdout:created hierarchy, now cleaning up
2016-07-24T01:36:24.635 INFO:tasks.workunit:Stopping ['fs/misc'] on client.0...
2016-07-24T01:36:24.635 INFO:teuthology.orchestra.run.mira066:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/workunit.client.0 /home/ubuntu/cephtest/clone'
2016-07-24T01:36:24.913 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 83, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/workunit.py", line 404, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 196, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 394, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 121, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test fs/misc/dirfrag.sh) on mira066 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=a0294e64507a7916fdd9707ae22ba40b0d7b65d1 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 5h /home/ubuntu/cephtest/workunit.client.0/fs/misc/dirfrag.sh'
2016-07-29T21:42:37.528 INFO:tasks.workunit:Running workunits matching suites/blogbench.sh on client.0...
2016-07-29T21:42:37.528 INFO:tasks.workunit:Running workunit suites/blogbench.sh...
2016-07-29T21:42:37.528 INFO:teuthology.orchestra.run.mira061:Running (workunit test suites/blogbench.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e7cfdc842e731ba964c72a8b3fd07459a1d27c39 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/suites/blogbench.sh'
2016-07-30T00:42:37.635 INFO:tasks.workunit.client.0.mira061.stdout:getting blogbench
2016-07-30T00:42:37.673 INFO:tasks.workunit.client.0.mira061.stdout:blogbench-1.0/
2016-07-30T00:42:37.673 INFO:tasks.workunit.client.0.mira061.stdout:blogbench-1.0/m4/
...
2016-07-30T00:42:37.730 INFO:tasks.workunit.client.0.mira061.stderr:Connecting to download.ceph.com (download.ceph.com)|173.236.253.173|:80... connected.
2016-07-30T00:42:37.730 INFO:tasks.workunit.client.0.mira061.stderr:HTTP request sent, awaiting response... 200 OK
2016-07-30T00:42:37.730 INFO:tasks.workunit.client.0.mira061.stderr:Length: 103355 (101K) [application/octet-stream]
2016-07-30T00:42:37.730 INFO:tasks.workunit.client.0.mira061.stderr:Saving to: ‘blogbench-1.0.tar.bz2’
2016-07-30T00:42:37.731 INFO:tasks.workunit.client.0.mira061.stderr:
2016-07-30T00:42:37.731 INFO:tasks.workunit.client.0.mira061.stderr:     0K .......... .......... .......... .......... .......... 49% 4.43M 0s
2016-07-30T00:42:37.731 INFO:tasks.workunit.client.0.mira061.stderr:    50K .......... .......... .......... .......... .......... 99% 5.73M 0s
2016-07-30T00:42:37.731 INFO:tasks.workunit.client.0.mira061.stderr:   100K                                                       100% 1779G=0.02s
2016-07-30T00:42:37.732 INFO:tasks.workunit.client.0.mira061.stderr:
2016-07-30T00:42:37.732 INFO:tasks.workunit.client.0.mira061.stderr:2016-07-30 04:42:37 (5.04 MB/s) - ‘blogbench-1.0.tar.bz2’ saved [103355/103355]
2016-07-30T00:42:37.732 INFO:tasks.workunit.client.0.mira061.stderr:
2016-07-30T00:42:37.732 INFO:tasks.workunit:Stopping ['suites/blogbench.sh'] on client.0...

Also:

http://qa-proxy.ceph.com/teuthology/pdonnell-2016-07-29_12:44:23-multimds-master---basic-mira/340842/
Timeout 3h running suites/blogbench.sh
1 jobs: ['340842']
suites: ['clusters/3-mds.yaml', 'debug/mds_client.yaml', 'fs/btrfs.yaml', 'inline/no.yaml', 'mount/kclient.yaml', 'multimds/basic/{ceph/base.yaml', 'overrides/whitelist_wrongly_marked_down.yaml', 'tasks/suites_blogbench.yaml}']

2016-07-18T22:36:20.009 INFO:tasks.workunit.client.0.mira054.stdout:5/995: creat d5/db/dc/d9f/db1/f134 x:0 0 0
2016-07-18T22:36:20.017 INFO:tasks.workunit.client.0.mira054.stdout:5/996: link d5/db/dc/d18/f41 d5/db/dc/d16/d68/dbb/dc3/dc6/f135 0
2016-07-18T22:36:20.017 INFO:tasks.workunit.client.0.mira054.stdout:5/997: write d5/db/dc/d16/d68/dbb/f92 [4623915,122047] 0
2016-07-18T22:36:20.106 INFO:tasks.workunit.client.0.mira054.stdout:5/998: dwrite d5/db/dc/d16/d89/d97/f122 [0,4194304] 0
2016-07-18T22:36:20.116 INFO:tasks.workunit.client.0.mira054.stdout:5/999: symlink d5/db/dc/d4e/d62/l136 0
2016-07-19T01:34:46.305 INFO:tasks.workunit.client.0.mira054.stderr:/home/ubuntu/cephtest/workunit.client.0/suites/fsstress.sh: line 1: 17105 Terminated              $command
2016-07-19T01:34:46.306 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0...
2016-07-19T15:05:13.028 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/cpu_pm.h’
2016-07-19T15:05:13.028 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/bcd.h’
2016-07-19T15:05:13.030 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/shmem_fs.h’
2016-07-19T15:05:13.030 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/pm_clock.h’
2016-07-19T15:05:13.031 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/leds_pwm.h’
2016-07-19T15:05:13.031 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/timerfd.h’
2016-07-19T15:05:13.031 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/jz4740-adc.h’
2016-07-19T15:05:13.031 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/omapfb.h’
2016-07-19T15:05:13.031 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/marvell_phy.h’
2016-07-19T15:05:13.032 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/hid-sensor-hub.h’
2016-07-19T15:05:13.032 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/frontswap.h’
2016-07-19T15:05:13.032 INFO:tasks.workunit.client.0.mira112.stdout:removed ‘linux-4.0.5/include/linux/cgroup.h’
2016-07-19T17:37:29.506 INFO:tasks.workunit.client.0.mira112.stdout:r
2016-07-19T17:37:29.506 INFO:tasks.workunit:Stopping ['kernel_untar_build.sh'] on client.0...
2016-07-19T17:37:29.507 INFO:teuthology.orchestra.run.mira112:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/workunit.client.0 /home/ubuntu/cephtest/clone'

Also:

http://pulpito.ceph.com/pdonnell-2016-07-21_13:20:27-multimds-master---basic-mira/327456/
1 jobs: ['327456']
suites: ['clusters/3-mds.yaml', 'debug/mds_client.yaml', 'fs/btrfs.yaml', 'inline/no.yaml', 'mount/kclient.yaml', 'multimds/basic/{ceph/base.yaml', 'overrides/whitelist_wrongly_marked_down.yaml', 'tasks/kernel_untar_build.yaml}']

http://pulpito.ceph.com/pdonnell-2016-07-29_08:28:00-multimds-master---basic-mira/339899/
1 jobs: ['339899']
suites: ['clusters/9-mds.yaml', 'debug/mds_client.yaml', 'fs/btrfs.yaml', 'inline/no.yaml', 'mount/kclient.yaml', 'multimds/basic/{ceph/base.yaml', 'overrides/whitelist_wrongly_marked_down.yaml', 'tasks/kernel_untar_build.yaml}']

http://pulpito.ceph.com/pdonnell-2016-07-29_12:44:23-multimds-master---basic-mira/340753/
1 jobs: ['340753']
suites: ['clusters/3-mds.yaml', 'debug/mds_client.yaml', 'fs/btrfs.yaml', 'inline/no.yaml', 'mount/kclient.yaml', 'multimds/basic/{ceph/base.yaml', 'overrides/whitelist_wrongly_marked_down.yaml', 'tasks/kernel_untar_build.yaml}']

Note: not every one of these kernel_untar_build failures has a pause at the end of the test log. For example:

http://qa-proxy.ceph.com/teuthology/pdonnell-2016-07-29_08:28:00-multimds-master---basic-mira/339899/teuthology.log

has the pause prior to unpacking the tarball:

2016-07-29T16:46:25.488 INFO:tasks.workunit:Running workunit kernel_untar_build.sh...
2016-07-29T16:46:25.488 INFO:teuthology.orchestra.run.mira037:Running (workunit test kernel_untar_build.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e7cfdc842e731ba964c72a8b3fd07459a1d27c39 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/kernel_untar_build.sh'
2016-07-29T19:46:25.582 INFO:tasks.workunit.client.0.mira037.stdout:linux-4.0.5/
2016-07-29T19:46:25.582 INFO:tasks.workunit.client.0.mira037.stdout:linux-4.0.5/.gitignore
2016-07-29T19:46:25.583 INFO:tasks.workunit.client.0.mira037.stdout:linux-4.0.5/.mailmap
2016-07-29T19:46:25.583 INFO:tasks.workunit.client.0.mira037.stdout:linux-4.0.5/COPYING
...
2016-07-29T19:52:45.834 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/pcf50633/core.h’
2016-07-29T19:52:45.834 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/pcf50633/backlight.h’
2016-07-29T19:52:45.835 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/pcf50633/mbc.h’
2016-07-29T19:52:45.836 INFO:tasks.workunit.client.0.mira037.stdout:removed directory: ‘linux-4.0.5/include/linux/mfd/pcf50633’
2016-07-29T19:52:45.837 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/htc-pasic3.h’
2016-07-29T19:52:45.838 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/si476x-platform.h’
2016-07-29T19:52:45.860 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/tps65912.h’
2016-07-29T19:52:45.860 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/ezx-pcap.h’
2016-07-29T19:52:45.881 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-4.0.5/include/linux/mfd/wm8400-private.h’
2016-07-29T19:52:45.881 INFO:tasks.workunit.client.0.mira037.stdout:removed ‘linux-
2016-07-29T19:52:45.948 INFO:tasks.workunit:Stopping ['kernel_untar_build.sh'] on client.0...

History

#1 Updated by Patrick Donnelly about 3 years ago

Well I feel silly. This is actually more general but wasn't obvious by how I had organized the failures. I'm going to amend the description/title...

#2 Updated by Patrick Donnelly about 3 years ago

  • Subject changed from multimds: kclient hang (?) in tests during recursive unlink of kernel build to multimds: kclient hang (?) in tests
  • Description updated (diff)

Updated title/description.

#3 Updated by Patrick Donnelly about 3 years ago

Another blogbench:

http://pulpito.ceph.com/pdonnell-2016-07-29_08:28:00-multimds-master---basic-mira/339886/

Dead: 2016-07-29T19:06:17.967 INFO:teuthology.orchestra.run.mira034:Running: 'sudo umount /home/ubuntu/cephtest/mnt.0'
1 jobs: ['339886']
suites: ['clusters/3-mds.yaml', 'debug/mds_client.yaml', 'fs/btrfs.yaml', 'inline/yes.yaml', 'mount/kclient.yaml', 'multimds/basic/{ceph/base.yaml', 'overrides/whitelist_wrongly_marked_down.yaml', 'tasks/suites_blogbench.yaml}']

#4 Updated by John Spray about 3 years ago

  • Assignee set to Zheng Yan

#5 Updated by Zheng Yan almost 3 years ago

  • Status changed from New to Need More Info

#6 Updated by John Spray almost 3 years ago

  • Priority changed from Normal to High
  • Target version set to v12.0.0

#7 Updated by Zheng Yan almost 3 years ago

fix one bug https://github.com/ceph/ceph-client/commit/2a3d8aad521306c6537c67c518ea7c4023c74f12

If you see "failing to respond to capability release" health warning when requests hang. It's likely fixed by this patch

#8 Updated by Zheng Yan over 2 years ago

  • Status changed from Need More Info to Can't reproduce

#9 Updated by Patrick Donnelly 6 months ago

  • Category deleted (90)
  • Labels (FS) multimds added

Also available in: Atom PDF