Project

General

Profile

Actions

Bug #57599

open

thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "recovering+forced_recovery+undersized+remapped+peered" pg

Added by Laura Flores over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-06-23_21:29:45-rados-wip-yuri4-testing-2022-06-22-1415-pacific-distro-default-smithi/6895209

Description: rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/osd-dispatch-delay objectstore/filestore-xfs rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{ubuntu_latest} thrashers/morepggrow thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}

Looks like this one hit the "wait for recovery" assertion because a pg went into `forced_recovery`:

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 407
    },
    {
      "state_name": "recovering+undersized+remapped+peered",
      "count": 6
    },
    {
      "state_name": "active+clean+snaptrim",
      "count": 1
    },
    {
      "state_name": "recovering+forced_recovery+undersized+remapped+peered",
      "count": 1
    }
  ],
  "num_pgs": 415,
  "num_pools": 3,
  "num_objects": 278,
  "data_bytes": 546573824,
  "bytes_used": 8648445952,
  "bytes_avail": 854885040128,
  "bytes_total": 863533486080,
  "inactive_pgs_ratio": 0.016867469996213913,
  "misplaced_objects": 117,
  "misplaced_total": 1668,
  "misplaced_ratio": 0.07014388489208633
}

Later in the teuthology log..

2022-06-23T22:55:42.575 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.f6', 'version': "1050'1105", 'reported_seq': 2239, 'reported_epoch': 2542, 'state': 'recovering+undersized+remapped+peered', 'last_fresh': '2022-06-23T22:55:36.783346+0000', 'last_change': '2022-06-23T22:13:23.574428+0000', 'last_active': '2022-06-23T22:11:26.264154+0000', 'last_peered': '2022-06-23T22:55:36.783346+0000', 'last_clean': '2022-06-23T22:08:24.299325+0000', 'last_became_active': '2022-06-23T22:07:20.871370+0000', 'last_became_peered': '2022-06-23T22:12:36.928103+0000', 'last_unstale': '2022-06-23T22:55:36.783346+0000', 'last_undegraded': '2022-06-23T22:55:36.783346+0000', 'last_fullsized': '2022-06-23T22:12:36.695006+0000', 'mapping_epoch': 1288, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 800, 'last_epoch_clean': 998, 'parent': '0.0', 'parent_split_bits': 8, 'last_scrub': "244'423", 'last_scrub_stamp': '2022-06-23T21:54:14.774033+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2022-06-23T21:50:29.365864+0000', 'last_clean_scrub_stamp': '2022-06-23T21:54:14.774033+0000', 'log_size': 1105, 'ondisk_log_size': 1105, 'stats_invalid': True, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 40, 'stat_sum': {'num_bytes': 18419712, 'num_objects': 4, 'num_object_clones': 3, 'num_object_copies': 24, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 40, 'num_objects_unfound': 0, 'num_objects_dirty': 4, 'num_whiteouts': 0, 'num_read': 57, 'num_read_kb': 52748, 'num_write': 117, 'num_write_kb': 31772, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 182, 'num_bytes_recovered': 43966464, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [10, 3, 9, 4, 5, 2], 'acting': [2147483647, 3, 2147483647, 4, 5, 2], 'avail_no_missing': ['3(1)', '1(2)', '4(4)', '7(0)', '8(5)', '9(3)'], 'object_location_counts': [{'shards': '1(2),3(1),4(4),7(0),8(5),9(3)', 'objects': 8}, {'shards': '2(5),3(1),4(3),5(4)', 'objects': -4}], 'blocked_by': [], 'up_primary': 10, 'acting_primary': 3, 'purged_snaps': []}
2022-06-23T22:55:42.579 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_b8c7a7fe6f1f3e950cd5e719a5cb21ef1c2a0433/qa/tasks/ceph_manager.py", line 188, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_b8c7a7fe6f1f3e950cd5e719a5cb21ef1c2a0433/qa/tasks/ceph_manager.py", line 1404, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_b8c7a7fe6f1f3e950cd5e719a5cb21ef1c2a0433/qa/tasks/ceph_manager.py", line 2809, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

PG 3.ae went into `forced_recovery` state:

2022-06-23T22:55:42.572 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.ae', 'version': "1081'540", 'reported_seq': 2812, 'reported_epoch': 2542, 'state': 'recovering+forced_recovery+undersized+remapped+peered', 'last_fresh': '2022-06-23T22:55:36.378940+0000', 'last_change': '2022-06-23T22:17:16.147493+0000', 'last_active': '2022-06-23T22:09:22.746237+0000', 'last_peered': '2022-06-23T22:55:36.378940+0000', 'last_clean': '2022-06-23T22:09:01.856350+0000', 'last_became_active': '2022-06-23T22:09:13.359637+0000', 'last_became_peered': '2022-06-23T22:14:42.238691+0000', 'last_unstale': '2022-06-23T22:55:36.378940+0000', 'last_undegraded': '2022-06-23T22:55:36.378940+0000', 'last_fullsized': '2022-06-23T22:12:32.059228+0000', 'mapping_epoch': 1253, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 579, 'last_epoch_clean': 1005, 'parent': '0.0', 'parent_split_bits': 8, 'last_scrub': "171'179", 'last_scrub_stamp': '2022-06-23T21:53:00.560586+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2022-06-23T21:50:29.365864+0000', 'last_clean_scrub_stamp': '2022-06-23T21:53:00.560586+0000', 'log_size': 540, 'ondisk_log_size': 540, 'stats_invalid': True, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 25, 'stat_sum': {'num_bytes': 3735552, 'num_objects': 4, 'num_object_clones': 4, 'num_object_copies': 24, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 10, 'num_objects_unfound': 0, 'num_objects_dirty': 4, 'num_whiteouts': 0, 'num_read': 65, 'num_read_kb': 327557, 'num_write': 121, 'num_write_kb': 42260, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 53, 'num_bytes_recovered': 26013696, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [3, 4, 6, 7, 2, 9], 'acting': [3, 4, 2147483647, 7, 2147483647, 9], 'avail_no_missing': ['3(0)', '0(2)', '4(1)', '7(3)', '8(4)', '9(5)'], 'object_location_counts': [{'shards': '0(2),3(0),4(1),7(3),8(4),9(5)', 'objects': 5}, {'shards': '3(0),4(1),7(3),9(5)', 'objects': -1}], 'blocked_by': [], 'up_primary': 3, 'acting_primary': 3, 'purged_snaps': [{'start': '26f', 'length': '1'}, {'start': '271', 'length': '1'}, {'start': '276', 'length': '1'}, {'start': '279', 'length': '2'}, {'start': '27c', 'length': '2'}, {'start': '282', 'length': '1'}, {'start': '284', 'length': '1'}, {'start': '286', 'length': '1'}, {'start': '28c', 'length': '2'}, {'start': '28f', 'length': '1'}, {'start': '291', 'length': '1'}, {'start': '294', 'length': '1'}]}


Related issues 2 (2 open0 closed)

Related to RADOS - Bug #58289: "AssertionError: wait_for_recovery: failed before timeout expired" from down pg in pacific-p2p-pacificNew

Actions
Copied from RADOS - Bug #56149: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovering+undersized" pgNewLaura Flores

Actions
Actions #1

Updated by Laura Flores over 1 year ago

  • Copied from Bug #56149: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovering+undersized" pg added
Actions #2

Updated by Laura Flores over 1 year ago

  • Priority changed from Low to Normal
Actions #3

Updated by Laura Flores over 1 year ago

  • Related to Bug #58289: "AssertionError: wait_for_recovery: failed before timeout expired" from down pg in pacific-p2p-pacific added
Actions

Also available in: Atom PDF