Project

General

Profile

Bug #44381

kclient: crash/hang during qa/workunits/fs/snaps/snaptest-capwb.sh

Added by Patrick Donnelly 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature:

Description

2020-02-29T09:35:22.472 INFO:tasks.workunit:Running workunit fs/snaps/snaptest-capwb.sh...
2020-02-29T09:35:22.473 INFO:teuthology.orchestra.run.smithi105:workunit test fs/snaps/snaptest-capwb.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=1b30588872aa57834eb528ae5a31abd968ddcfed TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/snaps/snaptest-capwb.sh
2020-02-29T09:35:22.495 INFO:tasks.workunit.client.0.smithi105.stderr:+ set -e
2020-02-29T09:35:22.496 INFO:tasks.workunit.client.0.smithi105.stderr:+ mkdir foo
2020-02-29T09:35:22.501 INFO:tasks.workunit.client.0.smithi105.stderr:+ ceph fs set cephfs allow_new_snaps true --yes-i-really-mean-it
...
2020-02-29T09:35:24.393 INFO:tasks.workunit.client.0.smithi105.stderr:enabled new snapshots
2020-02-29T09:35:52.133 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:35:52.136 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:35:52.140 INFO:teuthology.orchestra.run.smithi167:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:36:22.299 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:36:22.302 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:36:22.307 INFO:teuthology.orchestra.run.smithi167:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:36:52.347 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:36:52.349 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:36:52.353 INFO:teuthology.orchestra.run.smithi167:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:37:22.482 INFO:teuthology.orchestra.run.smithi012:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:37:22.485 INFO:teuthology.orchestra.run.smithi105:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T09:52:31.974 ERROR:paramiko.transport:Socket exception: No route to host (113)
2020-02-29T09:52:32.002 DEBUG:teuthology.orchestra.run:got remote process result: None
2020-02-29T09:52:32.002 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
2020-02-29T09:52:32.002 INFO:teuthology.orchestra.remote:Trying to reconnect to host
2020-02-29T09:52:32.003 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'smithi105.front.sepia.ceph.com', 'timeout': 60}
2020-02-29T09:52:32.004 DEBUG:tasks.ceph:Missed logrotate, host unreachable
2020-02-29T09:52:35.078 DEBUG:teuthology.orchestra.remote:[Errno None] Unable to connect to port 22 on 172.21.15.105
2020-02-29T09:52:35.078 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/workunit.py", line 140, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/workunit.py", line 290, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 87, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 37, in resurrect_traceback
    reraise(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 24, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_batrick_ceph_wip-pdonnell-testing-20200229.001503/qa/tasks/workunit.py", line 426, in _run_tests
    args=args,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 198, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 416, in run
    raise ConnectionLostError(command=quote(args), node=name)
ConnectionLostError: SSH connection to smithi105 was lost: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'

From: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4811017/teuthology.log

See also:

Failure: SSH connection to smithi105 was lost: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
5 jobs: ['4811017', '4810943', '4810906', '4811165', '4811128']
suites intersection: ['clusters/1-mds-1-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/cephfs/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'tasks/kclient_workunit_snaps.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
suites union: ['clusters/1-mds-1-client.yaml', 'conf/{client.yaml', 'k-testing.yaml}', 'kcephfs/cephfs/{begin.yaml', 'kclient/{mount.yaml', 'log-config.yaml', 'mds.yaml', 'mon.yaml', 'ms-die-on-skipped.yaml}}', 'objectstore-ec/bluestore-bitmap.yaml', 'objectstore-ec/bluestore-comp.yaml', 'objectstore-ec/bluestore-ec-root.yaml', 'objectstore-ec/filestore-xfs.yaml', 'osd-asserts.yaml', 'osd.yaml}', 'overrides/{distro/testing/{flavor/centos_latest.yaml', 'overrides/{distro/testing/{flavor/ubuntu_latest.yaml', 'overrides/{frag_enable.yaml', 'tasks/kclient_workunit_snaps.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']

I think the final error message is misleading. We didn't yet get to the point of cleaning up the workunit directory.

History

#1 Updated by Patrick Donnelly 4 months ago

Note: this appears to only happen with the testing kernel. Must be a regression!

#2 Updated by Patrick Donnelly 4 months ago

Another workunit failed same way: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4811054/teuthology.log

2020-02-29T10:03:24.280 INFO:tasks.workunit.client.0.smithi205.stderr:enabled new snapshots
2020-02-29T10:03:24.288 INFO:tasks.workunit.client.0.smithi205.stderr:+ echo x
2020-02-29T10:03:30.431 INFO:teuthology.orchestra.run.smithi159:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T10:03:30.435 INFO:teuthology.orchestra.run.smithi200:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T10:03:30.441 INFO:teuthology.orchestra.run.smithi205:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-02-29T21:45:57.858 DEBUG:teuthology.exit:Got signal 15; running 2 handlers...
2020-02-29T21:45:57.877 DEBUG:teuthology.task.console_log:Killing console logger for smithi159
2020-02-29T21:45:57.878 DEBUG:teuthology.task.console_log:Killing console logger for smithi205
2020-02-29T21:45:57.878 DEBUG:teuthology.task.console_log:Killing console logger for smithi200
2020-02-29T21:45:57.878 DEBUG:teuthology.task.console_log:Killing console logger for smithi159
2020-02-29T21:45:57.879 DEBUG:teuthology.task.console_log:Killing console logger for smithi205
2020-02-29T21:45:57.879 DEBUG:teuthology.task.console_log:Killing console logger for smithi200
2020-02-29T21:45:57.879 DEBUG:teuthology.exit:Finished running handlers

#3 Updated by Jeff Layton 4 months ago

I suspect this is related to the merging of:

[PATCH v3 0/6] ceph: don't request caps for idle open files

I've backed that series out of the testing branch for now, so we can see whether this problem goes away.

#4 Updated by Zheng Yan 4 months ago

  • Status changed from New to Closed

it's bug in v3 patches. patches in testing branch are v5, should have fixed the bug

Also available in: Atom PDF