Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs - RADOS - Ceph

Actions

Copy link

Bug #57546

open

rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs

Added by Laura Flores over 1 year ago. Updated over 1 year ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Aishwarya Mathuria

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

49537

Crash signature (v1):

Crash signature (v2):

Description

When testing the Quincy RC for 17.2.4, we discovered this failure:

Description: rados/thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay objectstore/bluestore-comp-zstd rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{rhel_8} thrashers/default thrashosds-health workloads/ec-rados-plugin=clay-k=4-m=2}

Dead job: http://pulpito.front.sepia.ceph.com/yuriw-2022-09-10_14:05:53-rados-quincy-release-distro-default-smithi/7024541/

PG states right before assertion:

"pgs_by_state": [
  {
    "state_name": "active+clean+remapped",
    "count": 24
  },
  {
    "state_name": "active+remapped+backfill_wait",
    "count": 10
  },
  {
    "state_name": "active+clean",
    "count": 9
  },
  {
    "state_name": "active+clean+remapped+laggy",
    "count": 4
  },
  {
    "state_name": "active+remapped+backfill_wait+laggy",
    "count": 3
  },
  {
    "state_name": "active+remapped+backfill_wait+backfill_toofull",
    "count": 3
  },
  {
    "state_name": "active+remapped+backfilling",
    "count": 2
  }
]

Assertion:

2022-09-14T16:11:25.475 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.f', 'version': "720'927", 'reported_seq': 4753, 'reported_epoch': 1260, 'state': 'active+clean+remapped', 'last_fresh': '2022-09-14T16:11:15.739084+0000', 'last_change': '2022-09-14T15:50:58.307958+0000', 'last_active': '2022-09-14T16:11:15.739084+0000', 'last_peered': '2022-09-14T16:11:15.739084+0000', 'last_clean': '2022-09-14T16:11:15.739084+0000', 'last_became_active': '2022-09-14T15:50:58.305711+0000', 'last_became_peered': '2022-09-14T15:50:58.305711+0000', 'last_unstale': '2022-09-14T16:11:15.739084+0000', 'last_undegraded': '2022-09-14T16:11:15.739084+0000', 'last_fullsized': '2022-09-14T16:11:15.739084+0000', 'mapping_epoch': 767, 'log_start': "374'504", 'ondisk_log_start': "374'504", 'created': 20, 'last_epoch_clean': 768, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "622'880", 'last_scrub_stamp': '2022-09-14T15:47:26.745379+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2022-09-14T15:36:55.663066+0000', 'last_clean_scrub_stamp': '2022-09-14T15:47:26.745379+0000', 'objects_scrubbed': 14, 'log_size': 423, 'ondisk_log_size': 423, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 1, 'scrub_schedule': 'queued for scrub', 'scrub_duration': 0.605125007, 'objects_trimmed': 1, 'snaptrim_duration': 0.073652501, 'stat_sum': {'num_bytes': 16678912, 'num_objects': 9, 'num_object_clones': 5, 'num_object_copies': 54, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 27, 'num_objects_unfound': 0, 'num_objects_dirty': 9, 'num_whiteouts': 1, 'num_read': 10, 'num_read_kb': 34515, 'num_write': 44, 'num_write_kb': 880, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 18, 'num_bytes_recovered': 35438592, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [7, 2147483647, 2147483647, 2147483647, 5, 2], 'acting': [7, 4, 0, 1, 5, 2], 'avail_no_missing': ['7(0)', '0(2)', '1(3)', '2(5)', '4(1)', '5(4)'], 'object_location_counts': [{'shards': '0(2),1(3),2(5),4(1),5(4),7(0)', 'objects': 9}], 'blocked_by': [], 'up_primary': 7, 'acting_primary': 7, 'purged_snaps': [{'start': '1bf', 'length': '1'}, {'start': '1c2', 'length': '1'}, {'start': '1c4', 'length': '1'}, {'start': '1cc', 'length': '1'}, {'start': '1d3', 'length': '2'}, {'start': '1db', 'length': '2'}]}
2022-09-14T16:11:25.476 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 198, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 1423, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_f860b69d9ecbf6a250d17b83c1778c894700357d/qa/tasks/ceph_manager.py", line 2850, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

The same test passed in a rerun.

Since the failure is intermittent, we found that running the test 50 times is a reliable reproducer, as seen by:
http://pulpito.front.sepia.ceph.com/lflores-2022-09-14_15:11:39-rados-quincy-release-distro-default-smithi/

We suspected https://github.com/ceph/ceph/pull/47901 to be the culprit, so we reran the reproducer with those two commits reverted:
http://pulpito.front.sepia.ceph.com/lflores-2022-09-14_17:02:02-rados-wip-lflores-testing-quincy-release-distro-default-smithi/

All 50 tests passed with the commits from https://github.com/ceph/ceph/pull/47901 reverted.

So far, this bug has only been observed in Quincy (and once on Pacific, but the affected test branch included the Pacific backport of https://github.com/ceph/ceph/pull/47555, which hasn't been merged yet). It's possible that this bug exists in main since the main PR has merged.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Laura Flores over 1 year ago

Quincy revert PR https://github.com/ceph/ceph/pull/48104

Not sure if we want to put this as a "fix".

Actions

Copy link

Updated by Laura Flores over 1 year ago

Related to Bug #53806: unessesarily long laggy PG state added

Actions

Copy link

Updated by Laura Flores over 1 year ago

Running the reproducer to see whether this bug also occurs on main:
http://pulpito.front.sepia.ceph.com/lflores-2022-09-15_14:32:14-rados:thrash-erasure-code-main-distro-default-smithi/ (20/50 failed)

Actions

Copy link

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-09-14_13:16:11-rados-wip-yuri6-testing-2022-09-13-1352-distro-default-smithi/7032356

Actions

Copy link

Updated by Aishwarya Mathuria over 1 year ago

Thanks for the reproducer Laura, I'm looking into the failures.

Actions

Copy link

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-09-29_16:44:24-rados-wip-lflores-testing-distro-default-smithi/7048304
/a/lflores-2022-09-30_21:47:41-rados-wip-lflores-testing-distro-default-smithi/7050796

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Assignee set to Aishwarya Mathuria

I think this a fix for that got reverted in quincy (https://tracker.ceph.com/issues/53806) but it's still in main. We should at least revert it there too.

Actions

Copy link

Updated by Aishwarya Mathuria over 1 year ago

I was able to reproduce this using the test Laura mentioned above - http://pulpito.front.sepia.ceph.com/amathuri-2022-10-16_17:40:13-rados-testing-laggy-distro-default-smithi/ (15/50 dead jobs)
I suspect one of the commits in the PR is causing this issue: https://github.com/ceph/ceph/pull/47901/commits/0d07b406dc2f854363f7ae9b970e980400f4f03e
Tried a test without this commit and all tests passed - http://pulpito.front.sepia.ceph.com/amathuri-2022-10-18_07:49:44-rados-testing-laggy-distro-default-smithi/

Actions

Copy link

#10

Updated by Radoslaw Zarzynski over 1 year ago

If https://github.com/ceph/ceph/pull/47901/commits/0d07b406dc2f854363f7ae9b970e980400f4f03e is the actual culprit, this means we can have far severe issue: duplicated entries in acting set. How about placing an assert there to verify the items of acting set are unique? We should backport it (just main).

Actions

Copy link

#11

Updated by Radoslaw Zarzynski over 1 year ago

As this is about EC: can be acting's items duplicated?

Actions

Copy link

#12

Updated by Aishwarya Mathuria over 1 year ago

Thanks for taking a look Radek! That's a good point since we are seeing this issue with rados/thrash-erasure-code test only. I'll look into it.

Actions

Copy link

#13

Updated by Radoslaw Zarzynski over 1 year ago

Yeah, worth looking the msgr encode issue has the priority.

Actions

Copy link

#14

Updated by Joshua Baergen over 1 year ago

If there are indeed duplicated entries in the acting set, should there be a 'break' at all in this loop? It seems like breaking out might leave acting_readable_until_ub too low for one of the duplicated entries, i.e., require multiple lease renewals in order to update all of the acting_readable_until_ub entries.

Actions

Copy link

#15