Project

General

Profile

Bug #48773

qa: scrub does not complete

Added by Patrick Donnelly 9 months ago. Updated 6 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
qa-failure, task(medium)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-01-06T01:58:53.086 INFO:tasks.fwd_scrub.fs.[cephfs]:scrub status for tag:4a5ba7a2-3f38-424b-aa70-9c2bb711d766 - {'path': '/', 'tag': '4a5ba7a2-3f38-424b-aa70-9c2bb711d766', 'options': 'recursive,force'}
2021-01-06T01:58:53.087 ERROR:tasks.fwd_scrub.fs.[cephfs]:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 40, in _run
    self.do_scrub()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 57, in do_scrub
    self._scrub()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 76, in _scrub
    return self._wait_until_scrub_complete(tag)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20210105.221014/qa/tasks/fwd_scrub.py", line 81, in _wait_until_scrub_complete
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (30) after waiting for 900 seconds
2021-01-06T01:58:57.267 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.fs.[cephfs] failed
2021-01-06T01:58:57.268 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons

From: /ceph/teuthology-archive/pdonnell-2021-01-06_00:07:44-fs:workload-wip-pdonnell-testing-20210105.221014-distro-basic-smithi/5758061/teuthology.log

same failure causes all of these failures:

Failure: Command failed on smithi071 with status 1: 'sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0/tmp'
1 jobs: ['5758061']
suites: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{centos_8}', 'fs:workload/{begin', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-bitmap', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/stock/{k-stock', 'overrides/{frag_enable', 'ranks/5', 'rhel_8}', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/suites/blogbench}}']

Crash: Command failed (workunit test fs/misc/multiple_rsync.sh) on smithi192 with status 23: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d0ed162b51928c50f20cee111f8292828eda755e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/misc/multiple_rsync.sh'
ceph version 16.0.0-8719-gd0ed162b (d0ed162b51928c50f20cee111f8292828eda755e) pacific (dev)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f6c914c5980]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19c) [0x7f6c926463ce]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f6c92646558]
 6: ceph-mon(+0x7a1d02) [0x558dad62fd02]
 7: (Monitor::~Monitor()+0x9) [0x558dad62fd49]
 8: main()
 9: __libc_start_main()
 10: _start()
2 jobs: ['5758073', '5758087']
suites intersection: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'fs:workload/{begin', 'mds', 'mon', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{frag_enable', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']
suites union: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'distro/{ubuntu_latest}', 'fs:workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/fuse', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-comp-ec-root', 'objectstore-ec/bluestore-ec-root', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/centos_latest', 'overrides/{frag_enable', 'ranks/3', 'ranks/5', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']

Timeout 3h running clone.client.0/qa/workunits/fs/misc/multiple_rsync.sh
1 jobs: ['5758031']
suites: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'fs:workload/{begin', 'k-testing}', 'mds', 'mon', 'mount/kclient/{mount', 'ms-die-on-skipped}}', 'objectstore-ec/bluestore-ec-root', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{distro/testing/{flavor/ubuntu_latest', 'overrides/{frag_enable', 'ranks/5', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']

Failure: Command failed (workunit test fs/misc/multiple_rsync.sh) on smithi204 with status 23: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d0ed162b51928c50f20cee111f8292828eda755e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/misc/multiple_rsync.sh'
1 jobs: ['5758003']
suites: ['clusters/1a5s-mds-1c-client-3node', 'conf/{client', 'distro/{rhel_8}', 'fs:workload/{begin', 'mds', 'mon', 'mount/fuse', 'objectstore-ec/bluestore-comp', 'omap_limit/10000', 'osd-asserts', 'osd}', 'overrides/{frag_enable', 'ranks/3', 'scrub/yes', 'session_timeout', 'tasks/{0-check-counter', 'whitelist_health', 'whitelist_wrongly_marked_down}', 'workunit/fs/misc}}']

This seems to only happen with scrubs involving multiple MDS.


Related issues

Related to CephFS - Bug #48680: mds: scrubbing stuck "scrub active (0 inodes in the stack)" New

History

#1 Updated by Patrick Donnelly 9 months ago

  • Status changed from New to Triaged
  • Assignee set to Kotresh Hiremath Ravishankar

#2 Updated by Patrick Donnelly 9 months ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport set to pacific,octopus,nautilus

#3 Updated by Kotresh Hiremath Ravishankar 9 months ago

  • Status changed from Triaged to In Progress

#4 Updated by Patrick Donnelly 6 months ago

  • Related to Bug #48680: mds: scrubbing stuck "scrub active (0 inodes in the stack)" added

#5 Updated by Patrick Donnelly 6 months ago

Another: /ceph/teuthology-archive/pdonnell-2021-05-01_09:07:09-fs-wip-pdonnell-testing-20210501.040415-distro-basic-smithi/6087780/teuthology.log

Also available in: Atom PDF