Bug #49983: Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error. - RADOS - Ceph

Actions

Copy link

Bug #49983

closed

Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error.

Added by Sridhar Seshasayee about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sridhar Seshasayee

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

40415

Crash signature (v1):

Crash signature (v2):

Description

Test Run:
https://pulpito.ceph.com/nojha-2021-03-23_23:04:33-rados-wip-40323-2-distro-basic-gibba/5991116/

Failure Reason:

2021-03-24T20:56:27.553 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/task/install/__init__.py", line 619, in task
    yield
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_5849de271cc68aa8a9f9c122ed82373715f40dc6/qa/tasks/ceph.py", line 1902, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_5849de271cc68aa8a9f9c122ed82373715f40dc6/qa/tasks/ceph.py", line 1243, in osd_scrub_pgs
    raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.")
RuntimeError: Scrubbing terminated -- not all pgs were active and clean.

Analysis:
The test failed because scrubbing couldn't be performed on osds since a pg (pg 2.27) was still in active+remapped+backfilling state and couldn't complete it in the time allotted within osd_scrub_pgs(). There were 2 other PGs (2.31 and 2.37) that completed the backfilling process before backfilling on 2.27 was picked up.

2021-03-24T20:53:52.337 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.27'] to go clean

The op queue shard on which backfilling was ongoing was being managed by the mclock scheduler. With the default
profile enabled being "high_client_ops", recoveries/backfills are given lower bandwidth allocation and as a result
these operations progress slower than when compared to WPQ scheduler.

Proposed Fix:
Since the slower recoveries/backfills are expected with mclock scheduler the following changes to the tests are proposed:
1. Make a call to wait_for_clean() within qa/tasks/ceph.py in the task() function just prior to calling
osd_scrub_pgs() - This would immediately address the issue in the short term.
2. Modify the test specs to change the mclock profile to "high_recovery_ops" so that recovery ops are given higher
bandwidth allocation. This would be a longer term fix along with 1 above.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha about 3 years ago

Project changed from bluestore to RADOS

Actions

Copy link

Updated by Sridhar Seshasayee about 3 years ago

Pull request ID set to 40415

Actions

Copy link

Updated by Sridhar Seshasayee about 3 years ago

Status changed from New to Fix Under Review

Actions

Copy link

Updated by Neha Ojha about 3 years ago

Backport set to pacific

Actions

Copy link

Updated by Neha Ojha about 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Backport Bot about 3 years ago

Copied to Backport #50018: pacific: Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error. added

Actions

Copy link

Updated by Loïc Dachary about 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #49983

Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error.

Updated by Neha Ojha about 3 years ago

Updated by Sridhar Seshasayee about 3 years ago

Updated by Sridhar Seshasayee about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Loïc Dachary about 3 years ago