Bug #57546
openrados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs
0%
Description
When testing the Quincy RC for 17.2.4, we discovered this failure:
Description: rados/thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay objectstore/bluestore-comp-zstd rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{rhel_8} thrashers/default thrashosds-health workloads/ec-rados-plugin=clay-k=4-m=2}
PG states right before assertion:
"pgs_by_state": [
{
"state_name": "active+clean+remapped",
"count": 24
},
{
"state_name": "active+remapped+backfill_wait",
"count": 10
},
{
"state_name": "active+clean",
"count": 9
},
{
"state_name": "active+clean+remapped+laggy",
"count": 4
},
{
"state_name": "active+remapped+backfill_wait+laggy",
"count": 3
},
{
"state_name": "active+remapped+backfill_wait+backfill_toofull",
"count": 3
},
{
"state_name": "active+remapped+backfilling",
"count": 2
}
]
Assertion:
2022-09-14T16:11:25.475 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.f', 'version': "720'927", 'reported_seq': 4753, 'reported_epoch': 1260, 'state': 'active+clean+remapped', 'last_fresh': '2022-09-14T16:11:15.739084+0000', 'last_change': '2022-09-14T15:50:58.307958+0000', 'last_active': '2022-09-14T16:11:15.739084+0000', 'last_peered': '2022-09-14T16:11:15.739084+0000', 'last_clean': '2022-09-14T16:11:15.739084+0000', 'last_became_active': '2022-09-14T15:50:58.305711+0000', 'last_became_peered': '2022-09-14T15:50:58.305711+0000', 'last_unstale': '2022-09-14T16:11:15.739084+0000', 'last_undegraded': '2022-09-14T16:11:15.739084+0000', 'last_fullsized': '2022-09-14T16:11:15.739084+0000', 'mapping_epoch': 767, 'log_start': "374'504", 'ondisk_log_start': "374'504", 'created': 20, 'last_epoch_clean': 768, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "622'880", 'last_scrub_stamp': '2022-09-14T15:47:26.745379+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2022-09-14T15:36:55.663066+0000', 'last_clean_scrub_stamp': '2022-09-14T15:47:26.745379+0000', 'objects_scrubbed': 14, 'log_size': 423, 'ondisk_log_size': 423, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 1, 'scrub_schedule': 'queued for scrub', 'scrub_duration': 0.605125007, 'objects_trimmed': 1, 'snaptrim_duration': 0.073652501, 'stat_sum': {'num_bytes': 16678912, 'num_objects': 9, 'num_object_clones': 5, 'num_object_copies': 54, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 27, 'num_objects_unfound': 0, 'num_objects_dirty': 9, 'num_whiteouts': 1, 'num_read': 10, 'num_read_kb': 34515, 'num_write': 44, 'num_write_kb': 880, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 18, 'num_bytes_recovered': 35438592, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [7, 2147483647, 2147483647, 2147483647, 5, 2], 'acting': [7, 4, 0, 1, 5, 2], 'avail_no_missing': ['7(0)', '0(2)', '1(3)', '2(5)', '4(1)', '5(4)'], 'object_location_counts': [{'shards': '0(2),1(3),2(5),4(1),5(4),7(0)', 'objects': 9}], 'blocked_by': [], 'up_primary': 7, 'acting_primary': 7, 'purged_snaps': [{'start': '1bf', 'length': '1'}, {'start': '1c2', 'length': '1'}, {'start': '1c4', 'length': '1'}, {'start': '1cc', 'length': '1'}, {'start': '1d3', 'length': '2'}, {'start': '1db', 'length': '2'}]}
2022-09-14T16:11:25.476 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 198, in wrapper
return func(self)
File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 1423, in _do_thrash
timeout=self.config.get('timeout')
File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 2850, in wait_for_recovery
'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired
The same test passed in a rerun.
Since the failure is intermittent, we found that running the test 50 times is a reliable reproducer, as seen by:
http://pulpito.front.sepia.ceph.com/lflores-2022-09-14_15:11:39-rados-quincy-release-distro-default-smithi/
We suspected https://github.com/ceph/ceph/pull/47901 to be the culprit, so we reran the reproducer with those two commits reverted:
http://pulpito.front.sepia.ceph.com/lflores-2022-09-14_17:02:02-rados-wip-lflores-testing-quincy-release-distro-default-smithi/
All 50 tests passed with the commits from https://github.com/ceph/ceph/pull/47901 reverted.
So far, this bug has only been observed in Quincy (and once on Pacific, but the affected test branch included the Pacific backport of https://github.com/ceph/ceph/pull/47555, which hasn't been merged yet). It's possible that this bug exists in main since the main PR has merged.
Updated by Laura Flores over 1 year ago
Quincy revert PR https://github.com/ceph/ceph/pull/48104
Not sure if we want to put this as a "fix".
Updated by Laura Flores over 1 year ago
- Related to Bug #53806: unessesarily long laggy PG state added
Updated by Laura Flores over 1 year ago
Running the reproducer to see whether this bug also occurs on main:
http://pulpito.front.sepia.ceph.com/lflores-2022-09-15_14:32:14-rados:thrash-erasure-code-main-distro-default-smithi/ (20/50 failed)
Updated by Laura Flores over 1 year ago
/a/yuriw-2022-09-14_13:16:11-rados-wip-yuri6-testing-2022-09-13-1352-distro-default-smithi/7032356
Updated by Aishwarya Mathuria over 1 year ago
Thanks for the reproducer Laura, I'm looking into the failures.
Updated by Laura Flores over 1 year ago
/a/yuriw-2022-09-29_16:44:24-rados-wip-lflores-testing-distro-default-smithi/7048304
/a/lflores-2022-09-30_21:47:41-rados-wip-lflores-testing-distro-default-smithi/7050796
Updated by Radoslaw Zarzynski over 1 year ago
- Assignee set to Aishwarya Mathuria
I think this a fix for that got reverted in quincy (https://tracker.ceph.com/issues/53806) but it's still in main
. We should at least revert it there too.
Updated by Aishwarya Mathuria over 1 year ago
I was able to reproduce this using the test Laura mentioned above - http://pulpito.front.sepia.ceph.com/amathuri-2022-10-16_17:40:13-rados-testing-laggy-distro-default-smithi/ (15/50 dead jobs)
I suspect one of the commits in the PR is causing this issue: https://github.com/ceph/ceph/pull/47901/commits/0d07b406dc2f854363f7ae9b970e980400f4f03e
Tried a test without this commit and all tests passed - http://pulpito.front.sepia.ceph.com/amathuri-2022-10-18_07:49:44-rados-testing-laggy-distro-default-smithi/
Updated by Radoslaw Zarzynski over 1 year ago
If https://github.com/ceph/ceph/pull/47901/commits/0d07b406dc2f854363f7ae9b970e980400f4f03e is the actual culprit, this means we can have far severe issue: duplicated entries in acting set. How about placing an assert there to verify the items of acting set are unique? We should backport it (just main).
Updated by Radoslaw Zarzynski over 1 year ago
As this is about EC: can be acting's items duplicated?
Updated by Aishwarya Mathuria over 1 year ago
Thanks for taking a look Radek! That's a good point since we are seeing this issue with rados/thrash-erasure-code test only. I'll look into it.
Updated by Radoslaw Zarzynski over 1 year ago
Yeah, worth looking the msgr encode issue has the priority.
Updated by Joshua Baergen over 1 year ago
If there are indeed duplicated entries in the acting set, should there be a 'break' at all in this loop? It seems like breaking out might leave acting_readable_until_ub too low for one of the duplicated entries, i.e., require multiple lease renewals in order to update all of the acting_readable_until_ub entries.
Updated by Radoslaw Zarzynski over 1 year ago
I think the invariant here is that the acting
container should not have duplicates. If it is broken, we have a more severe issue.
Updated by Radoslaw Zarzynski over 1 year ago
Sent a PR for quincy: https://github.com/ceph/ceph/pull/49304.
Updated by Laura Flores over 1 year ago
/a/yuriw-2022-12-07_15:48:38-rados-wip-yuri3-testing-2022-12-06-1211-distro-default-smithi/7106890
Updated by Laura Flores over 1 year ago
@Radoslaw Smigielski it's also still in main
Updated by Radoslaw Zarzynski over 1 year ago
I guess in main we should revert the opposite commit as both are there.
Updated by Radoslaw Zarzynski over 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 49537