Project

General

Profile

Actions

Bug #49983

closed

Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error.

Added by Sridhar Seshasayee about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Test Run:
https://pulpito.ceph.com/nojha-2021-03-23_23:04:33-rados-wip-40323-2-distro-basic-gibba/5991116/

Failure Reason:

2021-03-24T20:56:27.553 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/task/install/__init__.py", line 619, in task
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_5849de271cc68aa8a9f9c122ed82373715f40dc6/qa/tasks/ceph.py", line 1902, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_5849de271cc68aa8a9f9c122ed82373715f40dc6/qa/tasks/ceph.py", line 1243, in osd_scrub_pgs
    raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.")
RuntimeError: Scrubbing terminated -- not all pgs were active and clean.

Analysis:
The test failed because scrubbing couldn't be performed on osds since a pg (pg 2.27) was still in active+remapped+backfilling state and couldn't complete it in the time allotted within osd_scrub_pgs(). There were 2 other PGs (2.31 and 2.37) that completed the backfilling process before backfilling on 2.27 was picked up.

2021-03-24T20:53:52.337 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.27'] to go clean

The op queue shard on which backfilling was ongoing was being managed by the mclock scheduler. With the default
profile enabled being "high_client_ops", recoveries/backfills are given lower bandwidth allocation and as a result
these operations progress slower than when compared to WPQ scheduler.

Proposed Fix:
Since the slower recoveries/backfills are expected with mclock scheduler the following changes to the tests are proposed:
1. Make a call to wait_for_clean() within qa/tasks/ceph.py in the task() function just prior to calling
osd_scrub_pgs() - This would immediately address the issue in the short term.
2. Modify the test specs to change the mclock profile to "high_recovery_ops" so that recovery ops are given higher
bandwidth allocation. This would be a longer term fix along with 1 above.


Related issues 1 (0 open1 closed)

Copied to RADOS - Backport #50018: pacific: Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error.ResolvedSridhar SeshasayeeActions
Actions #1

Updated by Neha Ojha about 3 years ago

  • Project changed from bluestore to RADOS
Actions #2

Updated by Sridhar Seshasayee about 3 years ago

  • Pull request ID set to 40415
Actions #3

Updated by Sridhar Seshasayee about 3 years ago

  • Status changed from New to Fix Under Review
Actions #4

Updated by Neha Ojha about 3 years ago

  • Backport set to pacific
Actions #5

Updated by Neha Ojha about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50018: pacific: Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error. added
Actions #7

Updated by Loïc Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF