Project

General

Profile

Actions

Bug #49868

open

RuntimeError: Exiting scrub checking -- not all pgs scrubbed

Added by Neha Ojha about 3 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-17T00:29:07.666 INFO:tasks.ceph:pgid 2.7f last_scrub_stamp 2021-03-16T23:59:46.873365+0000 time.struct_time(tm_year=2021, tm_mon=3, tm_mday=16, tm_hour=23, tm_min=59, tm_sec=46, tm_wday=1, tm_yday=75, tm_isdst=-1) <= time.struct_time(tm_year=2021, tm_mon=3, tm_mday=17, tm_hour=0, tm_min=8, tm_sec=27, tm_wday=2, tm_yday=76, tm_isdst=0)
2021-03-17T00:29:07.667 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b96569170f15eae4604f361990ea65737b28dff1/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph-c_28d383a7778b37b4ae7e48d1f4a054c4b4139cb1/qa/tasks/ceph.py", line 1893, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_28d383a7778b37b4ae7e48d1f4a054c4b4139cb1/qa/tasks/ceph.py", line 1281, in osd_scrub_pgs
    raise RuntimeError('Exiting scrub checking -- not all pgs scrubbed.')
RuntimeError: Exiting scrub checking -- not all pgs scrubbed.

/a/sage-2021-03-16_20:18:15-rados-wip-sage2-testing-2021-03-16-0838-pacific-distro-basic-smithi/5971648
More in /a/sage-2021-03-16_20:18:15-rados-wip-sage2-testing-2021-03-16-0838-pacific-distro-basic-smithi


Related issues 2 (0 open2 closed)

Related to RADOS - Bug #48843: Get more parallel scrubs within osd_max_scrubs limitsResolvedDavid Zafman

Actions
Has duplicate RADOS - Bug #50140: test/thrash - scrub: "not all pgs scrubbed" due to short rescrubbing periodDuplicateDavid Zafman

Actions
Actions #1

Updated by Neha Ojha about 3 years ago

  • Assignee set to David Zafman

Something merged after 37f9d0a25d06a6b8529aa350110eba930fba8c9e since https://pulpito.ceph.com/yuriw-2021-03-15_23:42:14-rados-wip-yuriw-pacific_3.15.21-distro-basic-smithi/ had no such failures.

e0ed0122526791547a317c6ca19ed081a92dfe69 this seems to be the only related change

Actions #2

Updated by Neha Ojha about 3 years ago

I think we should revert this in pacific https://github.com/ceph/ceph/pull/40195, until we can fix the test failures.

Actions #3

Updated by Neha Ojha about 3 years ago

  • Related to Bug #48843: Get more parallel scrubs within osd_max_scrubs limits added
Actions #4

Updated by Kefu Chai about 3 years ago

/kchai-2021-03-26_05:32:58-rados-wip-kefu-testing-2021-03-26-1134-distro-basic-smithi/6001105/

Actions #5

Updated by Ronen Friedman about 3 years ago

In the log I've checked (http://pulpito.front.sepia.ceph.com/rfriedma-2021-04-01_17:51:51-rados-wip-ronenf-cscrub-class-distro-basic-smithi/6015307/), the cause is a combination of:
- re-scrub period ("osd scrub min interval") is set in radod/thrash* to (only) 60s.
- a large set of PGs to scrub.
- a PG that failed to reserve replica resources.

The failure flag will only be erased once the queue of PGs to scrub is empty. But under the
first two conditions - that never happens.

Possible fixes to consider:

- a simple fix: extending the tests min-scrub-time;
- possibly better: modify the handling of the "failed once in achieving replicas' resources"
to be periodically cleared.

Actions #6

Updated by Neha Ojha about 3 years ago

  • Has duplicate Bug #50140: test/thrash - scrub: "not all pgs scrubbed" due to short rescrubbing period added
Actions #7

Updated by Neha Ojha about 3 years ago

https://github.com/ceph/ceph/pull/40623 - being reverted in master for the time being

Actions #8

Updated by Yuri Weinstein about 3 years ago

Neha Ojha wrote:

https://github.com/ceph/ceph/pull/40623 - being reverted in master for the time being

merged

Actions #9

Updated by Neha Ojha about 3 years ago

  • Priority changed from Immediate to Normal

Since the original patches have been reverted in pacific and master, downgrading this bug.

Actions

Also available in: Atom PDF