Ceph &raquo; RADOS

Category:

Target version:

% Done:

100%

Source:

Tags:

Backport:

quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

47138

Crash signature (v1):

Crash signature (v2):

Description

2021-03-11T20:42:36.926 INFO:teuthology.orchestra.run.smithi078.stderr:Error ENXIO: problem getting command descriptions from osd.2

2021-03-11T20:42:37.413 INFO:tasks.thrashosds.thrasher:not all PGs are active or peered
2021-03-11T20:42:37.415 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_5e61b15880d42cb4248c17ea9494bc11be2b7a78/qa/tasks/ceph_manager.py", line 121, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_5e61b15880d42cb4248c17ea9494bc11be2b7a78/qa/tasks/ceph_manager.py", line 1207, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph_5e61b15880d42cb4248c17ea9494bc11be2b7a78/qa/tasks/ceph_manager.py", line 879, in test_pool_min_size
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b7dbd64ff8500a97d0659a105ea1a1f415ec323b/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

/ceph/teuthology-archive/yuriw-2021-03-11_19:01:40-rados-octopus-distro-basic-smithi/5956659/teuthology.log

Related issues 5 (0 open — 5 closed)

Related to RADOS - Bug #54511: test_pool_min_size: AssertionError: not clean before minsize thrashing starts

Resolved

Related to RADOS - Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs

Resolved

Copied to RADOS - Backport #57022: pacific: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

Resolved

Copied to RADOS - Backport #57023: octopus: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

Rejected

Copied to RADOS - Backport #57024: quincy: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

Resolved

Updated by Neha Ojha about 3 years ago

Subject changed from octopus: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds to 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds
Backport set to pacific, octopus

2021-04-05T23:44:21.623 INFO:tasks.thrashosds.thrasher:not all PGs are active or peered
2021-04-05T23:44:21.623 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 173, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 1389, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 947, in test_pool_min_size
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

2021-04-05T23:44:21.624 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 1257, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 173, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 1389, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph_0f1aa79a4a47c870a3ab6944e8d57f0957ff349b/qa/tasks/ceph_manager.py", line 947, in test_pool_min_size
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

This seems to be the problematic PG.

{"pgid":"1.4","version":"91'279","reported_seq":"808","reported_epoch":"125","state":"down+remapped","last_fresh":"2021-04-05T23:44:16.305653+0000","last_change":"2021-04-05T23:43:39.185673+0000","last_active":"2021-04-05T23:43:38.179472+0000","last_peered":"2021-04-05T23:43:38.138830+0000","last_clean":"2021-04-05T23:43:34.184443+0000","last_became_active":"2021-04-05T23:43:37.519545+0000","last_became_peered":"2021-04-05T23:43:37.519545+0000","last_unstale":"2021-04-05T23:44:16.305653+0000","last_undegraded":"2021-04-05T23:44:16.305653+0000","last_fullsized":"2021-04-05T23:44:16.305653+0000","mapping_epoch":99,"log_start":"0'0","ondisk_log_start":"0'0","created":17,"last_epoch_clean":83,"parent":"0.0","parent_split_bits":0,"last_scrub":"0'0","last_scrub_stamp":"2021-04-05T23:42:12.174643+0000","last_deep_scrub":"0'0","last_deep_scrub_stamp":"2021-04-05T23:42:12.174643+0000","last_clean_scrub_stamp":"2021-04-05T23:42:12.174643+0000","log_size":279,"ondisk_log_size":279,"stats_invalid":false,"dirty_stats_invalid":false,"omap_stats_invalid":false,"hitset_stats_invalid":false,"hitset_bytes_stats_invalid":false,"pin_stats_invalid":false,"manifest_stats_invalid":false,"snaptrimq_len":0,"stat_sum":{"num_bytes":73375744,"num_objects":72,"num_object_clones":7,"num_object_copies":216,"num_objects_missing_on_primary":0,"num_objects_missing":0,"num_objects_degraded":0,"num_objects_misplaced":0,"num_objects_unfound":0,"num_objects_dirty":72,"num_whiteouts":2,"num_read":175,"num_read_kb":25975,"num_write":257,"num_write_kb":75064,"num_scrub_errors":0,"num_shallow_scrub_errors":0,"num_deep_scrub_errors":0,"num_objects_recovered":81,"num_bytes_recovered":89333760,"num_keys_recovered":0,"num_objects_omap":0,"num_objects_hit_set_archive":0,"num_bytes_hit_set_archive":0,"num_flush":0,"num_flush_kb":0,"num_evict":0,"num_evict_kb":0,"num_promote":0,"num_flush_mode_high":0,"num_flush_mode_low":0,"num_evict_mode_some":0,"num_evict_mode_full":0,"num_objects_pinned":0,"num_legacy_snapsets":0,"num_large_omap_objects":0,"num_objects_manifest":0,"num_omap_bytes":0,"num_omap_keys":0,"num_objects_repaired":0},"up":[3,4,7],"acting":[3,2147483647,2147483647],"avail_no_missing":[],"object_location_counts":[],"blocked_by":[5],"up_primary":3,"acting_primary":3,"purged_snaps":[{"start":"1c","length":"1"},{"start":"3d","length":"1"},{"start":"3f","length":"1"}]}

/a/nojha-2021-04-05_22:22:02-rados-master-distro-basic-smithi/6023105 - no logs

Actions

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/676107 -- logs available

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

Actions

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767532

Actions

Updated by Radoslaw Zarzynski about 2 years ago

Subject changed from 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds to test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds

Actions

Updated by Radoslaw Zarzynski about 2 years ago

Related to Bug #54511: test_pool_min_size: AssertionError: not clean before minsize thrashing starts added

Actions

Updated by Radoslaw Zarzynski about 2 years ago

Assignee set to Kamoltat (Junior) Sirivadhna

Actions

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-05-31_21:35:41-rados-wip-yuri2-testing-2022-05-31-1300-pacific-distro-default-smithi/6856269

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-recovery} supported-random-distro$/{ubuntu_latest} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

The common piece among all of these failures seems to be "thrashers/minsize_recovery".

Actions

Updated by Radoslaw Zarzynski almost 2 years ago

Maybe let's talk on that in one of the RADOS Team meetings.

Actions

Updated by Radoslaw Zarzynski almost 2 years ago

Backport changed from pacific, octopus to pacific,octopus,quincy

Maybe let's talk on that in one of the RADOS Team meetings.

Actions

http://pulpito.front.sepia.ceph.com/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/

#10

Updated by Laura Flores almost 2 years ago

Running some tests to try and reproduce the issue and get a sense of how frequently it fails. This has actually been seen in both thrash-erasure-code AND thrash-erasure-code-overwrites, which share the same `minsize_recovery` thrasher:

2/100 tests failed due to the `check for active or peered` timeout
All other failures (except for 1 infra failure) were due to https://tracker.ceph.com/issues/51904
/a/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/6878849
/a/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/6878853

http://pulpito.front.sepia.ceph.com/lflores-2022-06-14_21:37:13-rados:thrash-erasure-code-overwrites-main-distro-default-smithi/

0/100 tests failed due to the `check for active or peered` timeout
All failures were due to https://tracker.ceph.com/issues/51904

Actions

#11

Updated by Laura Flores almost 2 years ago

Related to Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs added

Actions

#12

Updated by Sridhar Seshasayee almost 2 years ago

/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881131

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

Actions

#13

Updated by Laura Flores almost 2 years ago

Sridhar Seshasayee wrote:

/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881131

Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}

In this instance, 3 pgs went down and never left that state. This is where they first went down in the ceph.log:

/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881131/remote/smithi074/log/ceph.log.gz

2022-06-15T19:25:47.132035+0000 mgr.y (mgr.4104) 31 : cluster [DBG] pgmap v60: 16 pgs: 2 activating+undersized+degraded+remapped, 1 active+recovering+undersized+degraded+remapped, 10 active+clean, 3 down; 119 MiB data, 223 MiB used, 630 GiB / 630 GiB avail; 17/174 objects degraded (9.770%); 2.2 MiB/s, 0 objects/s recovering

This one had only one PG that was down+remapped, but again, it never left that state.
/a/lflores-2022-06-14_21:33:15-rados:thrash-erasure-code-main-distro-default-smithi/6878849

2022-06-15T01:10:24.792612+0000 mgr.x (mgr.4114) 39 : cluster [DBG] pgmap v71: 16 pgs: 1 active+recovery_wait+undersized+degraded+remapped, 1 creating+peering, 1 down+remapped, 10 active+clean, 3 active+recovering+undersized+degraded+remapped; 1.2 GiB data, 1.7 GiB used, 628 GiB / 630 GiB avail; 8.9 MiB/s rd, 2.8 MiB/s wr, 30 op/s; 267/3288 objects degraded (8.120%); 18 MiB/s, 15 objects/s recovering

This one had two PGs where one was down and another was down+remapped. The main detail that seems to matter is that at least one PG is down, and it doesn't make progress:
/a/yuriw-2022-05-31_21:35:41-rados-wip-yuri2-testing-2022-05-31-1300-pacific-distro-default-smithi/6856269

2022-05-31T22:09:47.122557+0000 mgr.x (mgr.4112) 32 : cluster [DBG] pgmap v58: 16 pgs: 1 down, 1 down+remapped, 4 active+undersized+degraded, 10 active+clean; 144 MiB data, 241 MiB used, 630 GiB / 630 GiB avail; 2.8 MiB/s rd, 271 KiB/s wr, 2 op/s; 17/198 objects degraded (8.586%)

Actions

#14

Updated by Laura Flores almost 2 years ago

/a/lflores-2022-06-27_23:44:12-rados:thrash-erasure-code-wip-yuri2-testing-2022-04-26-1132-octopus-distro-default-smithi/6901540

Description: rados:thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast msgr-failures/osd-delay objectstore/bluestore-bitmap rados recovery-overrides/{more-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-balanced}

Actions

#15

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

Status changed from New to In Progress

Actions

#16

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 47138

Actions

#17

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

I was able to reproduce the problem after modifying qa/tasks/ceph_manager.py: https://github.com/ceph/ceph/pull/46931/commits/1f6bcbb3d680d8589e498b993d2cf566480e2c3e.

Here are the runs that I reproduced with the modified test PR:
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921348
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921352
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921354
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921357
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921360
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921365
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921367
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921368
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921370
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921380
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921387

The Problem
When we kill and revive 2 OSDs consecutively that are both in a PGs acting set, the PG will go temporarily into a down state, however, it is waiting to recover some of its shards. test_pool_min_size:ceph_manager.all_active_or_peered() only allow 25 seconds in total for the recovery to finish, which is not enough.

Solution
Increase the timeout from 25 seconds to 100 seconds total in test_pool_min_size:ceph_manager.all_active_or_peered().

Actions

#18

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Status changed from Fix Under Review to Pending Backport

Actions

#19

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Backport changed from pacific,octopus,quincy to quincy,pacific

Actions

#20

Updated by Backport Bot over 1 year ago

Copied to Backport #57022: pacific: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

Actions

#21

Updated by Backport Bot over 1 year ago

Copied to Backport #57023: octopus: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

Actions

#22

Updated by Backport Bot over 1 year ago

Copied to Backport #57024: quincy: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added

Actions

#23

Updated by Backport Bot over 1 year ago

Tags set to backport_processed

Actions