Bug #44381
closedkclient: crash/hang during qa/workunits/fs/snaps/snaptest-capwb.sh
0%
Description
2020-02-29T09:35:22.472 INFO:tasks.workunit:Running workunit fs/snaps/snaptest-capwb.sh... 2020-02-29T09:35:22.473 INFO:teuthology.orchestra.run.smithi105:workunit test fs/snaps/snaptest-capwb.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=1b30588872aa57834eb528ae5a31abd968ddcfed TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/snaps/snaptest-capwb.sh 2020-02-29T09:35:22.495 INFO:tasks.workunit.client.0.smithi105.stderr:+ set -e 2020-02-29T09:35:22.496 INFO:tasks.workunit.client.0.smithi105.stderr:+ mkdir foo 2020-02-29T09:35:22.501 INFO:tasks.workunit.client.0.smithi105.stderr:+ ceph fs set cephfs allow_new_snaps true --yes-i-really-mean-it ... 2020-02-29T09:35:24.393 INFO:tasks.workunit.client.0.smithi105.stderr:enabled new snapshots 2020-02-29T09:35:52.133 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:35:52.136 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:35:52.140 INFO:teuthology.orchestra.run.smithi167:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:36:22.299 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:36:22.302 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:36:22.307 INFO:teuthology.orchestra.run.smithi167:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:36:52.347 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:36:52.349 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:36:52.353 INFO:teuthology.orchestra.run.smithi167:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:37:22.482 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:37:22.485 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T09:52:31.974 ERROR:paramiko.transport:Socket exception: No route to host (113) 2020-02-29T09:52:32.002 DEBUG:teuthology.orchestra.run:got remote process result: None 2020-02-29T09:52:32.002 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0... 2020-02-29T09:52:32.002 INFO:teuthology.orchestra.remote:Trying to reconnect to host 2020-02-29T09:52:32.003 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi105.front.sepia.ceph.com', 'timeout': 60} 2020-02-29T09:52:32.004 DEBUG:tasks.ceph:Missed logrotate, host unreachable 2020-02-29T09:52:35.078 DEBUG:teuthology.orchestra.remote:[Errno None] Unable to connect to port 22 on 172.21.15.105 2020-02-29T09:52:35.078 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/workunit.py", line 140, in task cleanup=cleanup) File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/workunit.py", line 290, in _spawn_on_all_clients timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 87, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in __next__ resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 37, in resurrect_traceback reraise(*exc_info) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 24, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/workunit.py", line 426, in _run_tests args=args, File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 198, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 416, in run raise ConnectionLostError(command=quote(args), node=name) ConnectionLostError: SSH connection to smithi105 was lost: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
From: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4811017/teuthology.log
See also:
Failure: SSH connection to smithi105 was lost: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0' 5 jobs: ['4811017', '4810943', '4810906', '4811165', '4811128'] suites intersection: ['clusters/1-mds-1-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/cephfs/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'tasks/kclient_workunit_snaps.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}'] suites union: ['clusters/1-mds-1-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/cephfs/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'objectstore-ec/bluestore-bitmap.yaml', 'objectstore-ec/bluestore-comp.yaml', 'objectstore-ec/bluestore-ec-root.yaml', 'objectstore-ec/filestore-xfs.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/testing/{flavor/centos_latest.yaml', 'overrides/{distro/testing/{flavor/ubuntu_latest.yaml', 'overrides/{frag_enable.yaml', 'tasks/kclient_workunit_snaps.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
I think the final error message is misleading. We didn't yet get to the point of cleaning up the workunit directory.
Updated by Patrick Donnelly about 4 years ago
Note: this appears to only happen with the testing kernel. Must be a regression!
Updated by Patrick Donnelly about 4 years ago
Another workunit failed same way: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4811054/teuthology.log
2020-02-29T10:03:24.280 INFO:tasks.workunit.client.0.smithi205.stderr:enabled new snapshots 2020-02-29T10:03:24.288 INFO:tasks.workunit.client.0.smithi205.stderr:+ echo x 2020-02-29T10:03:30.431 INFO:teuthology.orchestra.run.smithi159:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T10:03:30.435 INFO:teuthology.orchestra.run.smithi200:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T10:03:30.441 INFO:teuthology.orchestra.run.smithi205:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2020-02-29T21:45:57.858 DEBUG:teuthology.exit:Got signal 15; running 2 handlers... 2020-02-29T21:45:57.877 DEBUG:teuthology.task.console_log:Killing console logger for smithi159 2020-02-29T21:45:57.878 DEBUG:teuthology.task.console_log:Killing console logger for smithi205 2020-02-29T21:45:57.878 DEBUG:teuthology.task.console_log:Killing console logger for smithi200 2020-02-29T21:45:57.878 DEBUG:teuthology.task.console_log:Killing console logger for smithi159 2020-02-29T21:45:57.879 DEBUG:teuthology.task.console_log:Killing console logger for smithi205 2020-02-29T21:45:57.879 DEBUG:teuthology.task.console_log:Killing console logger for smithi200 2020-02-29T21:45:57.879 DEBUG:teuthology.exit:Finished running handlers
Updated by Jeff Layton about 4 years ago
I suspect this is related to the merging of:
[PATCH v3 0/6] ceph: don't request caps for idle open files
I've backed that series out of the testing branch for now, so we can see whether this problem goes away.
Updated by Zheng Yan about 4 years ago
- Status changed from New to Closed
it's bug in v3 patches. patches in testing branch are v5, should have fixed the bug