Project

General

Profile

Bug #54511

test_pool_min_size: AssertionError: not clean before minsize thrashing starts

Added by Kamoltat (Junior) Sirivadhna about 2 years ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-03-04_00:56:58-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6719015

2022-03-04T03:06:27.624 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 189, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 896, in test_pool_min_size
    'not clean before minsize thrashing starts'
AssertionError: not clean before minsize thrashing starts

2022-03-04T03:06:27.625 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1280, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 189, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_c8f79f870e0d6a996c92d420e6256d312bac1c7c/qa/tasks/ceph_manager.py", line 896, in test_pool_min_size
    'not clean before minsize thrashing starts'
AssertionError: not clean before minsize thrashing starts

This error occurs at the early stage of `test_pool_min_size`, where it checks if all the PGs are active+clean after spending at most 60 seconds waiting for PGs to be in active+clean,


Related issues

Related to RADOS - Bug #49777: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds Resolved
Related to RADOS - Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs Resolved
Copied to RADOS - Backport #57019: quincy: test_pool_min_size: AssertionError: not clean before minsize thrashing starts Resolved
Copied to RADOS - Backport #57020: pacific: test_pool_min_size: AssertionError: not clean before minsize thrashing starts Resolved

History

#1 Updated by Aishwarya Mathuria almost 2 years ago

/a/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767633

#2 Updated by Radoslaw Zarzynski almost 2 years ago

Need to observe more thrashers/minsize_recovery where this issue happens.

#3 Updated by Radoslaw Zarzynski almost 2 years ago

  • Related to Bug #49777: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

#4 Updated by Laura Flores almost 2 years ago

  • Related to Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs added

#5 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910103/

#6 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Description updated (diff)

#7 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Description updated (diff)

#8 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 47138

#9 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

I was able to reproduce the problem after modifying qa/tasks/ceph_manager.py: https://github.com/ceph/ceph/pull/46931/commits/1f6bcbb3d680d8589e498b993d2cf566480e2c3e.

Runs I was able to reproduce the problem after modifying qa/tasks/ceph_manager.py:
/a/ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921351
/a/ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921372
/a/ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921374
/a/ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921382
/a/ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921383
/a/ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921385

Problem
We didn’t give enough buffer between starting an osd backup and actually checking for active+clean. The pgs passed ceph_manager.wait_for_recovery and ceph_manager.wait_for_clean because recovery hasn’t start yet and eventually failed at ceph_manager.is_clean(). My analysis can be found here:
https://docs.google.com/document/d/1HKQc5kO-A9c7ThYTGtUlgTliYyfy__0tFXXs2KHLsZg/edit

Solution
Time out for 60 seconds before ceph_manager.wait_for_recovery + ceph_manager.wait_for_clean.

#10 Updated by Neha Ojha over 1 year ago

  • Assignee set to Kamoltat (Junior) Sirivadhna

#11 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to quincy, pacific

#12 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57019: quincy: test_pool_min_size: AssertionError: not clean before minsize thrashing starts added

#13 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57020: pacific: test_pool_min_size: AssertionError: not clean before minsize thrashing starts added

#14 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#15 Updated by Kamoltat (Junior) Sirivadhna 11 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF