Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs - RADOS - Ceph

Bug #57546

When testing the Quincy RC for 17.2.4, we discovered this failure: 

 Description: rados/thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay objectstore/bluestore-comp-zstd rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{rhel_8} thrashers/default thrashosds-health workloads/ec-rados-plugin=clay-k=4-m=2} 

 Dead job: http://pulpito.front.sepia.ceph.com/yuriw-2022-09-10_14:05:53-rados-quincy-release-distro-default-smithi/7024541/ 

 PG states right before assertion: 
 <pre><code class="text"> 
 "pgs_by_state": [ 
   { 
     "state_name": "active+clean+remapped", 
     "count": 24 
   }, 
   { 
     "state_name": "active+remapped+backfill_wait", 
     "count": 10 
   }, 
   { 
     "state_name": "active+clean", 
     "count": 9 
   }, 
   { 
     "state_name": "active+clean+remapped+laggy", 
     "count": 4 
   }, 
   { 
     "state_name": "active+remapped+backfill_wait+laggy", 
     "count": 3 
   }, 
   { 
     "state_name": "active+remapped+backfill_wait+backfill_toofull", 
     "count": 3 
   }, 
   { 
     "state_name": "active+remapped+backfilling", 
     "count": 2 
   } 
 ] 
 </code></pre> 

 Assertion: 
 <pre><code class="text"> 
 2022-09-14T16:11:25.475 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.f', 'version': "720'927", 'reported_seq': 4753, 'reported_epoch': 1260, 'state': 'active+clean+remapped', 'last_fresh': '2022-09-14T16:11:15.739084+0000', 'last_change': '2022-09-14T15:50:58.307958+0000', 'last_active': '2022-09-14T16:11:15.739084+0000', 'last_peered': '2022-09-14T16:11:15.739084+0000', 'last_clean': '2022-09-14T16:11:15.739084+0000', 'last_became_active': '2022-09-14T15:50:58.305711+0000', 'last_became_peered': '2022-09-14T15:50:58.305711+0000', 'last_unstale': '2022-09-14T16:11:15.739084+0000', 'last_undegraded': '2022-09-14T16:11:15.739084+0000', 'last_fullsized': '2022-09-14T16:11:15.739084+0000', 'mapping_epoch': 767, 'log_start': "374'504", 'ondisk_log_start': "374'504", 'created': 20, 'last_epoch_clean': 768, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "622'880", 'last_scrub_stamp': '2022-09-14T15:47:26.745379+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2022-09-14T15:36:55.663066+0000', 'last_clean_scrub_stamp': '2022-09-14T15:47:26.745379+0000', 'objects_scrubbed': 14, 'log_size': 423, 'ondisk_log_size': 423, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 1, 'scrub_schedule': 'queued for scrub', 'scrub_duration': 0.605125007, 'objects_trimmed': 1, 'snaptrim_duration': 0.073652501, 'stat_sum': {'num_bytes': 16678912, 'num_objects': 9, 'num_object_clones': 5, 'num_object_copies': 54, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 27, 'num_objects_unfound': 0, 'num_objects_dirty': 9, 'num_whiteouts': 1, 'num_read': 10, 'num_read_kb': 34515, 'num_write': 44, 'num_write_kb': 880, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 18, 'num_bytes_recovered': 35438592, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [7, 2147483647, 2147483647, 2147483647, 5, 2], 'acting': [7, 4, 0, 1, 5, 2], 'avail_no_missing': ['7(0)', '0(2)', '1(3)', '2(5)', '4(1)', '5(4)'], 'object_location_counts': [{'shards': '0(2),1(3),2(5),4(1),5(4),7(0)', 'objects': 9}], 'blocked_by': [], 'up_primary': 7, 'acting_primary': 7, 'purged_snaps': [{'start': '1bf', 'length': '1'}, {'start': '1c2', 'length': '1'}, {'start': '1c4', 'length': '1'}, {'start': '1cc', 'length': '1'}, {'start': '1d3', 'length': '2'}, {'start': '1db', 'length': '2'}]} 
 2022-09-14T16:11:25.476 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last): 
   File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 198, in wrapper 
     return func(self) 
   File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 1423, in _do_thrash 
     timeout=self.config.get('timeout') 
   File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 2850, in wait_for_recovery 
     'wait_for_recovery: failed before timeout expired' 
 AssertionError: wait_for_recovery: failed before timeout expired 
 </code></pre> 

 The same test passed in a rerun. 

 Since the failure is intermittent, we found that running the test 50 times is a reliable reproducer, as seen by: 
 http://pulpito.front.sepia.ceph.com/lflores-2022-09-14_15:11:39-rados-quincy-release-distro-default-smithi/ 

 We suspected https://github.com/ceph/ceph/pull/47901 https://github.com/ceph/ceph/pull/47555 to be the culprit, so we reran the reproducer with those two commits reverted: 
 http://pulpito.front.sepia.ceph.com/lflores-2022-09-14_17:02:02-rados-wip-lflores-testing-quincy-release-distro-default-smithi/ 

 All 50 tests passed with the commits from https://github.com/ceph/ceph/pull/47901 https://github.com/ceph/ceph/pull/47555 reverted. 

 So far, this bug has only been observed in Quincy (and once on Pacific, but the affected test branch included the Pacific backport of https://github.com/ceph/ceph/pull/47555, which hasn't been merged yet). It's possible that this bug exists in main since the main PR has merged.

Back

Project

General

Profile

Ceph » RADOS

Bug #57546