Bug #49983
closedTest Failed with: "Scrubbing terminated -- not all pgs were active and clean." error.
0%
Description
Test Run:
https://pulpito.ceph.com/nojha-2021-03-23_23:04:33-rados-wip-40323-2-distro-basic-gibba/5991116/
Failure Reason:
2021-03-24T20:56:27.553 ERROR:teuthology.contextutil:Saw exception from nested tasks Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/contextutil.py", line 33, in nested yield vars File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/task/install/__init__.py", line 619, in task yield File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b3150e9e0aa7ca432e26f31d87920ebd77f3708/teuthology/run_tasks.py", line 176, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/home/teuthworker/src/git.ceph.com_ceph-c_5849de271cc68aa8a9f9c122ed82373715f40dc6/qa/tasks/ceph.py", line 1902, in task osd_scrub_pgs(ctx, config) File "/home/teuthworker/src/git.ceph.com_ceph-c_5849de271cc68aa8a9f9c122ed82373715f40dc6/qa/tasks/ceph.py", line 1243, in osd_scrub_pgs raise RuntimeError("Scrubbing terminated -- not all pgs were active and clean.") RuntimeError: Scrubbing terminated -- not all pgs were active and clean.
Analysis:
The test failed because scrubbing couldn't be performed on osds since a pg (pg 2.27) was still in active+remapped+backfilling state and couldn't complete it in the time allotted within osd_scrub_pgs(). There were 2 other PGs (2.31 and 2.37) that completed the backfilling process before backfilling on 2.27 was picked up.
2021-03-24T20:53:52.337 INFO:tasks.ceph:Waiting for all PGs to be active+clean and split+merged, waiting on ['2.27'] to go clean
The op queue shard on which backfilling was ongoing was being managed by the mclock scheduler. With the default
profile enabled being "high_client_ops", recoveries/backfills are given lower bandwidth allocation and as a result
these operations progress slower than when compared to WPQ scheduler.
Proposed Fix:
Since the slower recoveries/backfills are expected with mclock scheduler the following changes to the tests are proposed:
1. Make a call to wait_for_clean() within qa/tasks/ceph.py in the task() function just prior to calling
osd_scrub_pgs() - This would immediately address the issue in the short term.
2. Modify the test specs to change the mclock profile to "high_recovery_ops" so that recovery ops are given higher
bandwidth allocation. This would be a longer term fix along with 1 above.
Updated by Sridhar Seshasayee about 3 years ago
- Status changed from New to Fix Under Review
Updated by Neha Ojha about 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 3 years ago
- Copied to Backport #50018: pacific: Test Failed with: "Scrubbing terminated -- not all pgs were active and clean." error. added
Updated by Loïc Dachary about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".