Project

General

Profile

Actions

Bug #59172

open

test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs

Added by Laura Flores about 1 year ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef, quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220990

2023-03-27T07:36:12.704 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '1.5', 'version': "781'981", 'reported_seq': 2565, 'reported_epoch': 985, 'state': 'active+undersized+degraded', 'last_fresh': '2023-03-27T07:36:02.849903+0000', 'last_change': '2023-03-27T07:15:53.567376+0000', 'last_active': '2023-03-27T07:36:02.849903+0000', 'last_peered': '2023-03-27T07:36:02.849903+0000', 'last_clean': '2023-03-27T07:15:50.823756+0000', 'last_became_active': '2023-03-27T07:15:53.567376+0000', 'last_became_peered': '2023-03-27T07:15:53.567376+0000', 'last_unstale': '2023-03-27T07:36:02.849903+0000', 'last_undegraded': '2023-03-27T07:15:53.555753+0000', 'last_fullsized': '2023-03-27T07:15:53.555539+0000', 'mapping_epoch': 162, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 17, 'last_epoch_clean': 143, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "105'164", 'last_scrub_stamp': '2023-03-27T07:14:56.445168+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2023-03-27T07:13:27.693605+0000', 'last_clean_scrub_stamp': '2023-03-27T07:14:56.445168+0000', 'objects_scrubbed': 15, 'log_size': 981, 'log_dups_size': 0, 'ondisk_log_size': 981, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 305, 'last_scrub_duration': 1, 'scrub_schedule': 'queued for scrub', 'scrub_duration': 0.012991803, 'objects_trimmed': 0, 'snaptrim_duration': 0.025207018, 'stat_sum': {'num_bytes': 232095744, 'num_objects': 156, 'num_object_clones': 151, 'num_object_copies': 936, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 156, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 156, 'num_whiteouts': 0, 'num_read': 383, 'num_read_kb': 890442, 'num_write': 1256, 'num_write_kb': 223520, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 57, 'num_bytes_recovered': 117489664, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [2147483647, 2, 3, 7, 5, 1], 'acting': [2147483647, 2, 3, 7, 5, 1], 'avail_no_missing': ['2(1)', '1(5)', '3(2)', '5(4)', '7(3)'], 'object_location_counts': [{'shards': '1(5),2(1),3(2),5(4),7(3)', 'objects': 156}], 'blocked_by': [], 'up_primary': 2, 'acting_primary': 2, 'purged_snaps': []}
2023-03-27T07:36:12.705 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ljflores_ceph_5e717292106ca2d310770101bfebb345837be8e1/qa/tasks/ceph_manager.py", line 190, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ljflores_ceph_5e717292106ca2d310770101bfebb345837be8e1/qa/tasks/ceph_manager.py", line 1401, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ljflores_ceph_5e717292106ca2d310770101bfebb345837be8e1/qa/tasks/ceph_manager.py", line 967, in test_pool_min_size
    self.ceph_manager.wait_for_clean(timeout=self.config.get('timeout'))
  File "/home/teuthworker/src/github.com_ljflores_ceph_5e717292106ca2d310770101bfebb345837be8e1/qa/tasks/ceph_manager.py", line 2747, in wait_for_clean
    assert time.time() - start < timeout, \
AssertionError: wait_for_clean: failed before timeout expired

Last attempt to fix was in https://tracker.ceph.com/issues/51904.


Related issues 6 (4 open2 closed)

Related to RADOS - Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGsResolvedKamoltat (Junior) Sirivadhna

Actions
Related to RADOS - Bug #61762: PGs are stucked in creating+peering when starting up OSDsNewLaura Flores

Actions
Related to RADOS - Bug #62400: test_pool_min_size: wait_for_clean passed with 0 PGsNewKamoltat (Junior) Sirivadhna

Actions
Copied to RADOS - Backport #64671: reef: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGsNewKamoltat (Junior) SirivadhnaActions
Copied to RADOS - Backport #64672: pacific: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGsRejectedKamoltat (Junior) SirivadhnaActions
Copied to RADOS - Backport #64673: quincy: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGsIn ProgressKamoltat (Junior) SirivadhnaActions
Actions #1

Updated by Laura Flores about 1 year ago

  • Related to Bug #51904: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs added
Actions #2

Updated by Radoslaw Zarzynski about 1 year ago

  • Assignee set to Kamoltat (Junior) Sirivadhna

Would you mind taking a look as you already have the context?

Actions #3

Updated by Matan Breizman 10 months ago

  • Related to Bug #61762: PGs are stucked in creating+peering when starting up OSDs added
Actions #4

Updated by Matan Breizman 10 months ago

/a/yuriw-2023-07-18_14:02:17-rados-wip-yuri6-testing-2023-07-17-0838-reef-distro-default-smithi/7342927

Added a releated similar tracker

Actions #5

Updated by Neha Ojha 9 months ago

  • Backport set to reef
Actions #6

Updated by Laura Flores 9 months ago

/a/yuriw-2023-07-28_23:11:59-rados-reef-release-distro-default-smithi/7355972

Actions #7

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

taking a look ...

Actions #8

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Update:

/a/yuriw-2023-07-28_23:11:59-rados-reef-release-distro-default-smithi/7355972/teuthology.log

Right before we go into Traceback due to timeout in `wait_for_clean`

I noticed that cluster was at the state where there were many PGs stuck in undersized & degraded.
Turns out for this 4+2 EC profile, there were 3 OSDs thrashed, therefore, it exceeds the max tolerance of failure for this profile which is 2.
Which explains why PGs are not getting active+clean.

2023-07-29T02:08:45.229 INFO:teuthology.orchestra.run.smithi052.stdout:{"fsid":"d0a271ca-83f0-4e1e-80a6-938776ac399f","health":{"status":"HEALTH_WARN","checks":{"OSDMAP_FLAGS":{"severity":"HEALTH_WARN","summary":{"message":"noscrub,nodeep-scrub flag(s) set","count":20},"muted":false},"PG_AVAILABILITY":{"severity":"HEALTH_WARN","summary":{"message":"Reduced data availability: 1 pg inactive","count":1},"muted":false},"PG_DEGRADED":{"severity":"HEALTH_WARN","summary":{"message":"Degraded data redundancy: 136/1134 objects degraded (11.993%), 7 pgs degraded, 8 pgs undersized","count":15},"muted":false}},"mutes":[]},"election_epoch":8,"quorum":[0,1,2],"quorum_names":["a","b","c"],"quorum_age":1455,"monmap":{"epoch":1,"min_mon_release_name":"reef","num_mons":3},"osdmap":{"epoch":681,"num_osds":8,"num_up_osds":5,"osd_up_since":1690595295,"num_in_osds":5,"osd_in_since":1690595301,"num_remapped_pgs":8},"pgmap":{"pgs_by_state":[{"state_name":"active+clean+remapped","count":8},{"state_name":"active+undersized+degraded","count":6},{"state_name":"undersized+degraded+peered","count":1},{"state_name":"active+undersized","count":1}],"num_pgs":16,"num_pools":1,"num_objects":189,"data_bytes":301154304,"bytes_used":1339781120,"bytes_avail":481844039680,"bytes_total":483183820800,"inactive_pgs_ratio":0.0625,"degraded_objects":136,"degraded_total":1134,"degraded_ratio":0.11992945326278658,"misplaced_objects":89,"misplaced_total":1134,"misplaced_ratio":0.078483245149911812},"fsmap":{"epoch":1,"by_rank":[],"up:standby":0},"mgrmap":{"available":true,"num_standbys":1,"modules":["iostat","nfs","restful"],"services":{}},"servicemap":{"epoch":16,"modified":"2023-07-29T02:08:10.694637+0000","services":{}},"progress_events":{}}
Actions #9

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

There are two cases of failure in these tracker:

1. EC pool OSD thrash with 4+2 profile with 8 OSDs, we fail 3/8 OSDs leaving us with k+1 OSDs (min_size). However it doesn't seem like the cluster actually recovers.
/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220990
/a/yuriw-2023-07-28_23:11:59-rados-reef-release-distro-default-smithi/7355972/teuthology.log
2. Normal replicated pool with replication size of 3: We started up 8 OSDs, we check if OSDs are all up and also if all PGs are in active+clean. Here 1 PG is stuck with creating+peering through out 1200 seconds or 20minutes.

I think for case 2 this is not related to thrash OSDs test, but related to how PGs are stuck in `creating+peering` when starting up OSDs

Actions #11

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

case 2 basically is related to https://tracker.ceph.com/issues/61762

Actions #12

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

For case1, the rerun suggests that we can recover with k+1 OSDs, therefore, I think the problem might be more to do with recovery rather than over thrashing the OSDs.

Actions #13

Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from New to In Progress
Actions #14

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

  • Related to Bug #62400: test_pool_min_size: wait_for_clean passed with 0 PGs added
Actions #15

Updated by Kamoltat (Junior) Sirivadhna 7 months ago

  • Status changed from In Progress to Fix Under Review
  • Backport changed from reef to reef, quincy, pacific
  • Pull request ID set to 52891
Actions #16

Updated by Kamoltat (Junior) Sirivadhna 3 months ago

/a/yuriw-2024-01-31_19:20:14-rados-wip-yuri3-testing-2024-01-29-1434-distro-default-smithi/7540755/

Actions #17

Updated by Radoslaw Zarzynski 3 months ago

Review in progress, waits on Sam.

Actions #18

Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #19

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64671: reef: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs added
Actions #20

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64672: pacific: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs added
Actions #21

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64673: quincy: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs added
Actions #22

Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF