Project

General

Profile

Bug #49777

test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

Added by Deepika Upadhyay almost 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-11T20:42:36.926 INFO:teuthology.orchestra.run.smithi078.stderr:Error ENXIO: problem getting command descriptions from osd.2
2021-03-11T20:42:37.413 INFO:tasks.thrashosds.thrasher:not all PGs are active or peered
2021-03-11T20:42:37.415 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_5e61b15880d42cb4248c17ea9494bc11be2b7a78/qa/tasks/ceph_manager.py", line 121, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_5e61b15880d42cb4248c17ea9494bc11be2b7a78/qa/tasks/ceph_manager.py", line 1207, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph_5e61b15880d42cb4248c17ea9494bc11be2b7a78/qa/tasks/ceph_manager.py", line 879, in test_pool_min_size
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b7dbd64ff8500a97d0659a105ea1a1f415ec323b/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

/ceph/teuthology-archive/yuriw-2021-03-11_19:01:40-rados-octopus-distro-basic-smithi/5956659/teuthology.log


Related issues

Related to RADOS - Bug #54511: test_pool_min_size: AssertionError: not clean before minsize thrashing starts Resolved
Related to RADOS - Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs Resolved
Copied to RADOS - Backport #57022: pacific: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds Resolved
Copied to RADOS - Backport #57023: octopus: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds Rejected
Copied to RADOS - Backport #57024: quincy: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds Resolved

History

#1 Updated by Neha Ojha almost 3 years ago

  • Subject changed from octopus: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds to 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds
  • Backport set to pacific, octopus
2021-04-05T23:44:21.623 INFO:tasks.thrashosds.thrasher:not all PGs are active or peered
2021-04-05T23:44:21.623 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 173, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 1389, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 947, in test_pool_min_size
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

2021-04-05T23:44:21.624 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 1257, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 173, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 1389, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 947, in test_pool_min_size
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

This seems to be the problematic PG.

{"pgid":"1.4","version":"91'279","reported_seq":"808","reported_epoch":"125","state":"down+remapped","last_fresh":"2021-04-05T23:44:16.305653+0000","last_change":"2021-04-05T23:43:39.185673+0000","last_active":"2021-04-05T23:43:38.179472+0000","last_peered":"2021-04-05T23:43:38.138830+0000","last_clean":"2021-04-05T23:43:34.184443+0000","last_became_active":"2021-04-05T23:43:37.519545+0000","last_became_peered":"2021-04-05T23:43:37.519545+0000","last_unstale":"2021-04-05T23:44:16.305653+0000","last_undegraded":"2021-04-05T23:44:16.305653+0000","last_fullsized":"2021-04-05T23:44:16.305653+0000","mapping_epoch":99,"log_start":"0'0","ondisk_log_start":"0'0","created":17,"last_epoch_clean":83,"parent":"0.0","parent_split_bits":0,"last_scrub":"0'0","last_scrub_stamp":"2021-04-05T23:42:12.174643+0000","last_deep_scrub":"0'0","last_deep_scrub_stamp":"2021-04-05T23:42:12.174643+0000","last_clean_scrub_stamp":"2021-04-05T23:42:12.174643+0000","log_size":279,"ondisk_log_size":279,"stats_invalid":false,"dirty_stats_invalid":false,"omap_stats_invalid":false,"hitset_stats_invalid":false,"hitset_bytes_stats_invalid":false,"pin_stats_invalid":false,"manifest_stats_invalid":false,"snaptrimq_len":0,"stat_sum":{"num_bytes":73375744,"num_objects":72,"num_object_clones":7,"num_object_copies":216,"num_objects_missing_on_primary":0,"num_objects_missing":0,"num_objects_degraded":0,"num_objects_misplaced":0,"num_objects_unfound":0,"num_objects_dirty":72,"num_whiteouts":2,"num_read":175,"num_read_kb":25975,"num_write":257,"num_write_kb":75064,"num_scrub_errors":0,"num_shallow_scrub_errors":0,"num_deep_scrub_errors":0,"num_objects_recovered":81,"num_bytes_recovered":89333760,"num_keys_recovered":0,"num_objects_omap":0,"num_objects_hit_set_archive":0,"num_bytes_hit_set_archive":0,"num_flush":0,"num_flush_kb":0,"num_evict":0,"num_evict_kb":0,"num_promote":0,"num_flush_mode_high":0,"num_flush_mode_low":0,"num_evict_mode_some":0,"num_evict_mode_full":0,"num_objects_pinned":0,"num_legacy_snapsets":0,"num_large_omap_objects":0,"num_objects_manifest":0,"num_omap_bytes":0,"num_omap_keys":0,"num_objects_repaired":0},"up":[3,4,7],"acting":[3,2147483647,2147483647],"avail_no_missing":[],"object_location_counts":[],"blocked_by":[5],"up_primary":3,"acting_primary":3,"purged_snaps":[{"start":"1c","length":"1"},{"start":"3d","length":"1"},{"start":"3f","length":"1"}]}

/a/nojha-2021-04-05_22:22:02-rados-master-distro-basic-smithi/6023105 - no logs

#2 Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/676107 -- logs available

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

#3 Updated by Aishwarya Mathuria almost 2 years ago

/a/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767532

#4 Updated by Radoslaw Zarzynski almost 2 years ago

  • Subject changed from 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds to test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

#5 Updated by Radoslaw Zarzynski almost 2 years ago

  • Related to Bug #54511: test_pool_min_size: AssertionError: not clean before minsize thrashing starts added

#6 Updated by Radoslaw Zarzynski almost 2 years ago

  • Assignee set to Kamoltat (Junior) Sirivadhna

#7 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-05-31_21:35:41-rados-wip-yuri2-testing-2022-05-31-1300-pacific-distro-default-smithi/6856269

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-recovery} supported-random-distro$/{ubuntu_latest} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

The common piece among all of these failures seems to be "thrashers/minsize_recovery".

#8 Updated by Radoslaw Zarzynski over 1 year ago

Maybe let's talk on that in one of the RADOS Team meetings.

#9 Updated by Radoslaw Zarzynski over 1 year ago

  • Backport changed from pacific, octopus to pacific,octopus,quincy

Maybe let's talk on that in one of the RADOS Team meetings.

#10 Updated by Laura Flores over 1 year ago

Running some tests to try and reproduce the issue and get a sense of how frequently it fails. This has actually been seen in both thrash-erasure-code AND thrash-erasure-code-overwrites, which share the same `minsize_recovery` thrasher:

http://pulpito.front.sepia.ceph.com/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/
  • 2/100 tests failed due to the `check for active or peered` timeout
  • All other failures (except for 1 infra failure) were due to https://tracker.ceph.com/issues/51904
  • /a/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/6878849
  • /a/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/6878853
http://pulpito.front.sepia.ceph.com/lflores-2022-06-14_21:37:13-rados:thrash-erasure-code-overwrites-main-distro-default-smithi/

#11 Updated by Laura Flores over 1 year ago

  • Related to Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs added

#12 Updated by Sridhar Seshasayee over 1 year ago

/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881131

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

#13 Updated by Laura Flores over 1 year ago

Sridhar Seshasayee wrote:

/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881131

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

In this instance, 3 pgs went down and never left that state. This is where they first went down in the ceph.log:

/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881131/remote/smithi074/log/ceph.log.gz

2022-06-15T19:25:47.132035+0000 mgr.y (mgr.4104) 31 : cluster [DBG] pgmap v60: 16 pgs: 2 activating+undersized+degraded+remapped, 1 active+recovering+undersized+degraded+remapped, 10 active+clean, 3 down; 119 MiB data, 223 MiB used, 630 GiB / 630 GiB avail; 17/174 objects degraded (9.770%); 2.2 MiB/s, 0 objects/s recovering

This one had only one PG that was down+remapped, but again, it never left that state.
/a/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/6878849

2022-06-15T01:10:24.792612+0000 mgr.x (mgr.4114) 39 : cluster [DBG] pgmap v71: 16 pgs: 1 active+recovery_wait+undersized+degraded+remapped, 1 creating+peering, 1 down+remapped, 10 active+clean, 3 active+recovering+undersized+degraded+remapped; 1.2 GiB data, 1.7 GiB used, 628 GiB / 630 GiB avail; 8.9 MiB/s rd, 2.8 MiB/s wr, 30 op/s; 267/3288 objects degraded (8.120%); 18 MiB/s, 15 objects/s recovering

This one had two PGs where one was down and another was down+remapped. The main detail that seems to matter is that at least one PG is down, and it doesn't make progress:
/a/yuriw-2022-05-31_21:35:41-rados-wip-yuri2-testing-2022-05-31-1300-pacific-distro-default-smithi/6856269

2022-05-31T22:09:47.122557+0000 mgr.x (mgr.4112) 32 : cluster [DBG] pgmap v58: 16 pgs: 1 down, 1 down+remapped, 4 active+undersized+degraded, 10 active+clean; 144 MiB data, 241 MiB used, 630 GiB / 630 GiB avail; 2.8 MiB/s rd, 271 KiB/s wr, 2 op/s; 17/198 objects degraded (8.586%)

#14 Updated by Laura Flores over 1 year ago

/a/lflores-2022-06-27_23:44:12-rados:thrash-erasure-code-wip-yuri2-testing-2022-04-26-1132-octopus-distro-default-smithi/6901540

Description: rados:thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast msgr-failures/osd-delay objectstore/bluestore-bitmap rados recovery-overrides/{more-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-balanced}

#15 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Status changed from New to In Progress

#16 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 47138

#17 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

I was able to reproduce the problem after modifying qa/tasks/ceph_manager.py: https://github.com/ceph/ceph/pull/46931/commits/1f6bcbb3d680d8589e498b993d2cf566480e2c3e.

Here are the runs that I reproduced with the modified test PR:
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921348
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921352
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921354
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921357
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921360
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921365
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921367
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921368
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921370
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921380
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921387

The Problem
When we kill and revive 2 OSDs consecutively that are both in a PGs acting set, the PG will go temporarily into a down state, however, it is waiting to recover some of its shards. test_pool_min_size:ceph_manager.all_active_or_peered() only allow 25 seconds in total for the recovery to finish, which is not enough.

Solution
Increase the timeout from 25 seconds to 100 seconds total in test_pool_min_size:ceph_manager.all_active_or_peered().

#18 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#19 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Backport changed from pacific,octopus,quincy to quincy,pacific

#20 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57022: pacific: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

#21 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57023: octopus: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

#22 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57024: quincy: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

#23 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#24 Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (backport_processed)

Also available in: Atom PDF