Bug #48773: qa: scrub does not complete - CephFS - Ceph

Actions

Copy link

Bug #48773

open

qa: scrub does not complete

Added by Patrick Donnelly over 3 years ago. Updated 7 months ago.

Status:

In Progress

Priority:

Normal

Assignee:

Kotresh Hiremath Ravishankar

Category:

Correctness/Safety

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

pacific,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds, qa-failure, scrub, task(medium)

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2021-01-06T01:58:53.086 INFO:tasks.fwd_scrub.fs.[cephfs]:scrub status for tag:4a5ba7a2-3f38-424b-aa70-9c2bb711d766 - {'path': '/', 'tag': '4a5ba7a2-3f38-424b-aa70-9c2bb711d766', 'options': 'recursive,force'}
2021-01-06T01:58:53.087 ERROR:tasks.fwd_scrub.fs.[cephfs]:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 40, in _run
    self.do_scrub()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 57, in do_scrub
    self._scrub()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 76, in _scrub
    return self._wait_until_scrub_complete(tag)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 81, in _wait_until_scrub_complete
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (30) after waiting for 900 seconds
2021-01-06T01:58:57.267 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.fs.[cephfs] failed
2021-01-06T01:58:57.268 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons

From: /ceph/teuthology-archive/pdonnell-2021-01-06_00:07:44-fs:workload-wip-pdonnell-testing-20210105.221014-distro-basic-smithi/5758061/teuthology.log

same failure causes all of these failures:

Failure: Command failed on smithi071 with status 1: 'sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0/tmp'
1 jobs: ['5758061']
suites: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{centos_8}', 'fs:workload/{begin', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-bitmap', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/stock/{k-stock', 'overrides/{frag_enable', 'ranks/5', 'rhel_8}', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/suites/blogbench}}']

Crash: Command failed (workunit test fs/misc/multiple_rsync.sh) on smithi192 with status 23: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d0ed162b51928c50f20cee111f8292828eda755e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/misc/multiple_rsync.sh'
ceph version 16.0.0-8719-gd0ed162b (d0ed162b51928c50f20cee111f8292828eda755e) pacific (dev)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f6c914c5980]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19c) [0x7f6c926463ce]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f6c92646558]
 6: ceph-mon(+0x7a1d02) [0x558dad62fd02]
 7: (Monitor::~Monitor()+0x9) [0x558dad62fd49]
 8: main()
 9: __libc_start_main()
 10: _start()
2 jobs: ['5758073', '5758087']
suites intersection: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'fs:workload/{begin', 'mds', 'mon', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{frag_enable', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']
suites union: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'distro/{ubuntu_latest}', 'fs:workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/fuse', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-comp-ec-root', 'objectstore-ec/bluestore-ec-root', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{frag_enable', 'ranks/3', 'ranks/5', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']

Timeout 3h running clone.client.0/qa/workunits/fs/misc/multiple_rsync.sh
1 jobs: ['5758031']
suites: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'fs:workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-ec-root', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/ubuntu_latest', 'overrides/{frag_enable', 'ranks/5', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']

Failure: Command failed (workunit test fs/misc/multiple_rsync.sh) on smithi204 with status 23: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d0ed162b51928c50f20cee111f8292828eda755e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/misc/multiple_rsync.sh'
1 jobs: ['5758003']
suites: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'fs:workload/{begin', 'mds', 'mon', 'mount/fuse', 'objectstore-ec/bluestore-comp', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{frag_enable', 'ranks/3', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']

This seems to only happen with scrubs involving multiple MDS.

Related issues 1 (1 open — 0 closed)