Bug #51904
closedtest_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs
0%
Description
2021-07-25T15:18:21.688 INFO:tasks.ceph.ceph_manager.ceph:PG 1.b is not active+clean 2021-07-25T15:18:21.688 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '1.b', 'version': "22'3", 'reported_seq': 475, 'reported_epoch': 478, 'state': 'down', 'last_fresh': '2021-07-25T15:18:15.848533+0000', 'last_change': '2021-07-25T14:58:09.960725+0000', 'last_active': '2021-07-25T14:58:09.960471+0000', 'last_peered': '2021-07-25T14:58:01.054827+0000', 'last_clean': '2021-07-25T14:58:01.054827+0000', 'last_became_active': '2021-07-25T14:58:00.955866+0000', 'last_became_peered': '2021-07-25T14:58:00.955866+0000', 'last_unstale': '2021-07-25T15:18:15.848533+0000', 'last_undegraded': '2021-07-25T15:18:15.848533+0000', 'last_fullsized': '2021-07-25T15:18:15.848533+0000', 'mapping_epoch': 31, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 18, 'last_epoch_clean': 22, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_clean_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'log_size': 3, 'ondisk_log_size': 3, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'stat_sum': {'num_bytes': 2554631, 'num_objects': 1, 'num_object_clones': 0, 'num_object_copies': 3, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 1, 'num_whiteouts': 0, 'num_read': 1, 'num_read_kb': 0, 'num_write': 4, 'num_write_kb': 1011, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [2, 0, 5], 'acting': [2, 0, 5], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [1], 'up_primary': 0, 'acting_primary': 0, 'purged_snaps': [{'start': '1', 'length': '1'}]} 2021-07-25T15:18:21.688 INFO:tasks.ceph.ceph_manager.ceph:PG 1.3 is not active+clean 2021-07-25T15:18:21.689 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '1.3', 'version': "0'0", 'reported_seq': 449, 'reported_epoch': 478, 'state': 'down', 'last_fresh': '2021-07-25T15:18:16.232816+0000', 'last_change': '2021-07-25T14:58:09.960717+0000', 'last_active': '0.000000', 'last_peered': '0.000000', 'last_clean': '0.000000', 'last_became_active': '0.000000', 'last_became_peered': '0.000000', 'last_unstale': '2021-07-25T15:18:16.232816+0000', 'last_undegraded': '2021-07-25T15:18:16.232816+0000', 'last_fullsized': '2021-07-25T15:18:16.232816+0000', 'mapping_epoch': 31, 'log_start': "0'0", 'ondisk_log_start': "0'0", 'created': 18, 'last_epoch_clean': 22, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'last_clean_scrub_stamp': '2021-07-25T14:57:55.882472+0000', 'log_size': 0, 'ondisk_log_size': 0, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'stat_sum': {'num_bytes': 0, 'num_objects': 0, 'num_object_clones': 0, 'num_object_copies': 0, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 0, 'num_write_kb': 0, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 0, 'num_bytes_recovered': 0, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [4, 6, 5], 'acting': [4, 6, 5], 'avail_no_missing': [], 'object_location_counts': [], 'blocked_by': [1], 'up_primary': 4, 'acting_primary': 4, 'purged_snaps': []} 2021-07-25T15:18:21.689 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last): File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 188, in wrapper return func(self) File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 1411, in _do_thrash self.choose_action()() File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 979, in test_pool_min_size self.ceph_manager.wait_for_clean(timeout=self.config.get('timeout')) File "/home/teuthworker/src/github.com_ceph_ceph-c_3df2a2c35be348281dd870cd64ba15314e6776a9/qa/tasks/ceph_manager.py", line 2719, in wait_for_clean 'wait_for_clean: failed before timeout expired' AssertionError: wait_for_clean: failed before timeout expired
rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{rhel_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}
/a/yuriw-2021-07-25_14:33:55-rados-wip-yuri7-testing-master-7.23.21-distro-basic-smithi/6292358 - no logs
Updated by Neha Ojha over 2 years ago
/a/yuriw-2021-12-07_16:04:59-rados-wip-yuri5-testing-2021-12-06-1619-distro-default-smithi/6551120
pg map right before Traceback:
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 15
},
{
"state_name": "down",
"count": 1
}
],
"num_pgs": 16,
"num_pools": 1,
"num_objects": 78,
"data_bytes": 150511044,
"bytes_used": 385671168,
"bytes_avail": 676071677952,
"bytes_total": 676457349120,
"inactive_pgs_ratio": 0.0625
}
Updated by Laura Flores over 2 years ago
/a/yuriw-2022-01-12_21:37:22-rados-wip-yuri6-testing-2022-01-12-1131-distro-default-smithi/6611439
last pg map before Traceback:
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 15
},
{
"state_name": "down+remapped",
"count": 1
}
],
"num_pgs": 16,
"num_pools": 1,
"num_objects": 44,
"data_bytes": 117580353,
"bytes_used": 245002240,
"bytes_avail": 676212346880,
"bytes_total": 676457349120,
"inactive_pgs_ratio": 0.0625
}
Updated by Laura Flores about 2 years ago
/a/yuriw-2022-02-05_22:51:11-rados-wip-yuri2-testing-2022-02-04-1646-pacific-distro-default-smithi/6663906
last pg map before Traceback:
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 25
},
{
"state_name": "active+recovery_wait+undersized+degraded+remapped",
"count": 7
},
{
"state_name": "active+recovering+undersized+degraded+remapped",
"count": 1
},
{
"state_name": "recovery_wait+undersized+degraded+remapped+peered",
"count": 1
}
],
"num_pgs": 34,
"num_pools": 2,
"num_objects": 615,
"data_bytes": 1190117376,
"bytes_used": 5568552960,
"bytes_avail": 1057435852800,
"bytes_total": 1063004405760,
"inactive_pgs_ratio": 0.029411764815449715,
"degraded_objects": 297,
"degraded_total": 3690,
"degraded_ratio": 0.08048780487804878
}
Updated by Laura Flores about 2 years ago
/a/yuriw-2022-02-08_17:00:23-rados-wip-yuri5-testing-2022-02-08-0733-pacific-distro-default-smithi/6670539
last pg map before Traceback:
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 91
},
{
"state_name": "active+recovery_wait+undersized+degraded+remapped",
"count": 7
},
{
"state_name": "active+undersized+degraded+remapped+backfill_wait",
"count": 4
},
{
"state_name": "recovery_wait+undersized+degraded+remapped+peered",
"count": 2
},
{
"state_name": "active+recovering+undersized+degraded+remapped",
"count": 1
}
],
"num_pgs": 105,
"num_pools": 3,
"num_objects": 351,
"data_bytes": 812849152,
"bytes_used": 5592006656,
"bytes_avail": 1057412399104,
"bytes_total": 1063004405760,
"inactive_pgs_ratio": 0.01904761977493763,
"degraded_objects": 470,
"degraded_total": 2106,
"degraded_ratio": 0.22317188983855651
}
Updated by Laura Flores about 2 years ago
/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6746893
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 40
},
{
"state_name": "recovery_wait+undersized+degraded+remapped+peered",
"count": 11
},
{
"state_name": "active+recovery_wait+undersized+degraded+remapped",
"count": 10
},
{
"state_name": "active+remapped+backfill_wait",
"count": 2
},
{
"state_name": "active+recovery_wait",
"count": 1
},
{
"state_name": "active+recovering+undersized+remapped",
"count": 1
}
],
"num_pgs": 65,
"num_pools": 3,
"num_objects": 123,
"data_bytes": 380551712,
"bytes_used": 3641274368,
"bytes_avail": 866089603072,
"bytes_total": 869730877440,
"inactive_pgs_ratio": 0.16923077404499054,
"degraded_objects": 1081,
"degraded_total": 730,
"degraded_ratio": 1.4808219178082191,
"misplaced_objects": 30,
"misplaced_total": 730,
"misplaced_ratio": 0.0410958904109589
}
Updated by Laura Flores about 2 years ago
/a/yuriw-2022-03-24_16:44:32-rados-wip-yuri-testing-2022-03-24-0726-distro-default-smithi/6758238
Updated by Laura Flores about 2 years ago
/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/6761700
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 29
},
{
"state_name": "active+recovery_wait+undersized+degraded+remapped",
"count": 3
},
{
"state_name": "recovery_wait+undersized+degraded+remapped+peered",
"count": 2
},
{
"state_name": "active+recovering+undersized+degraded+remapped",
"count": 1
}
],
"num_pgs": 35,
"num_pools": 3,
"num_objects": 198,
"data_bytes": 162054144,
"bytes_used": 1095106560,
"bytes_avail": 1061909299200,
"bytes_total": 1063004405760,
"inactive_pgs_ratio": 0.05714285746216774,
"degraded_objects": 124,
"degraded_total": 1188,
"degraded_ratio": 0.10437710437710437
}
Updated by Neha Ojha about 2 years ago
Laura Flores wrote:
/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/6761700
[...]
This does not have down PGs, some other bug maybe?
Updated by Laura Flores almost 2 years ago
Neha Ojha wrote:
Laura Flores wrote:
/a/yuriw-2022-03-25_18:42:52-rados-wip-yuri7-testing-2022-03-24-1341-pacific-distro-default-smithi/6761700
[...]This does not have down PGs, some other bug maybe?
Yeah that one looks very similar pg-map wise to #51076. It checks out too since the test uses the "mapgap" thrasher.
Description: rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/few objectstore/bluestore-hybrid rados recovery-overrides/{default} supported-random-distro$/{ubuntu_latest} thrashers/mapgap thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}
Updated by Laura Flores almost 2 years ago
- Backport changed from quincy, pacific to quincy, pacific, octopus
Updated by Radoslaw Zarzynski almost 2 years ago
Not urgent, perhaps not low-hanging-fruit but still good as a training issue.
Updated by Laura Flores almost 2 years ago
Other reported instances of this `wait_for_clean` assertion failure where the pgmap has a pg stuck in recovery have likely been fixed in #51076. This seems to be an example of what this Tracker is reporting. The assertion failure happend in a `rados/thrash-erasure-code-overwrites` test, and there were down pgs:
/a/yuriw-2022-05-27_21:59:17-rados-wip-yuri-testing-2022-05-27-0934-distro-default-smithi/6851025
Description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/minsize_recovery thrashosds-health workloads/ec-pool-snaps-few-objects-overwrites}
Last pgmap before the assertion failure:
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 12
},
{
"state_name": "down",
"count": 3
},
{
"state_name": "active+clean+scrubbing",
"count": 1
}
],
"num_pgs": 16,
"num_pools": 1,
"num_objects": 43,
"data_bytes": 112235760,
"bytes_used": 221577216,
"bytes_avail": 676235771904,
"bytes_total": 676457349120,
"inactive_pgs_ratio": 0.1875
}
Updated by Laura Flores almost 2 years ago
This `wait_for_clean` assertion failure is happening with the minsize_recovery thrasher, which is used by rados/thrash-erasure-code and rados/thrash-erasure-code-overwrites tests. This indicates that the minsize_recovery thrasher is somehow responsible. See https://tracker.ceph.com/issues/49777#note-10 for more recorded instances.
Updated by Laura Flores almost 2 years ago
- Related to Bug #49777: test_pool_min_size: 'check for active or peered' reached maximum tries (5) after waiting for 25 seconds added
Updated by Laura Flores almost 2 years ago
- Related to Bug #54511: test_pool_min_size: AssertionError: not clean before minsize thrashing starts added
Updated by Nitzan Mordechai almost 2 years ago
Talked with Ronen Friedman regarding that issue, it may be related to other bug that he is working on that the scrub didn't completed
Updated by Laura Flores almost 2 years ago
Oh interesting. Is there a reason that bug would only be affecting minsize_recovery?
Updated by Laura Flores almost 2 years ago
Found an instance where this does not occur with minsize_recovery. It's possible that it's a different root cause, but here we're also seeing down pgs in the pgmap:
Description: rados/thrash-erasure-code-isa/{arch/x86_64 ceph clusters/{fixed-2 openstack} mon_election/connectivity msgr-failures/fastclose objectstore/bluestore-comp-zstd rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/mapgap thrashosds-health workloads/ec-rados-plugin=isa-k=2-m=1}
/a/yuriw-2022-06-22_22:13:20-rados-wip-yuri3-testing-2022-06-22-1121-pacific-distro-default-smithi/6892939
{
"pgs_by_state": [
{
"state_name": "active+clean",
"count": 34
},
{
"state_name": "down+remapped",
"count": 1
}
],
"num_pgs": 35,
"num_pools": 3,
"num_objects": 112,
"data_bytes": 239910912,
"bytes_used": 2221826048,
"bytes_avail": 577598758912,
"bytes_total": 579820584960,
"inactive_pgs_ratio": 0.02857142873108387
}
Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago
/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/
Updated by Ronen Friedman almost 2 years ago
Kamoltat Sirivadhna wrote:
/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/
That one failed in ~Mon(), asserting on session_map.sessions.empty()
Updated by Ronen Friedman almost 2 years ago
Ronen Friedman wrote:
Kamoltat Sirivadhna wrote:
/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/
That one failed in ~Mon(), asserting on session_map.sessions.empty()
Updated by Neha Ojha almost 2 years ago
/a/ksirivad-2022-07-01_21:00:49-rados:thrash-erasure-code-main-distro-default-smithi/6910169/ first timed out and then we see the shutdown crash
2022-07-01T23:42:43.073 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last): ... 2022-07-01T23:42:50.766 INFO:tasks.ceph.mon.c.smithi105.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13311-g57487367/rpm/el8/BUILD/ceph-17.0.0-13311-g57487367/src/mon/Monitor.cc: 285: FAILED ceph_assert(session_map.sessions.empty())
Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago
- Subject changed from AssertionError: wait_for_clean: failed before timeout expired due to down PGs to test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs
- Assignee changed from Nitzan Mordechai to Kamoltat (Junior) Sirivadhna
- Pull request ID set to 47138
Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago
- Status changed from New to Fix Under Review
Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago
I was able to reproduce the problem after modifying qa/tasks/ceph_manager.py: https://github.com/ceph/ceph/pull/46931/commits/1f6bcbb3d680d8589e498b993d2cf566480e2c3e.
Here are the runs that I reproduced with the modified test PR:
ksirivad-2022-07-09_05:39:52-rados:thrash-erasure-code-main-distro-default-smithi/6921376
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929814
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929818
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929819
ksirivad-2022-07-14_03:50:33-rados:thrash-erasure-code-main-distro-default-smithi/6929855
The Problem:
The problem was due to test_pool_min_size():ceph_manager.wait_for_clean() not having enough timeout since according to my analysis: https://docs.google.com/document/d/1YQeMF2Rc_unQvS2GAtLRXTJEb9a-D66McUQbBdXWZAE/edit?usp=sharing, we can see that the PGs were temporarily in states like `snaptrim` and `backfill_full` but eventually got out of these states and became active+clean, it's just that sometimes 60 seconds is not enough.
Solution
We increase timeout in test_pool_min_size():ceph_manager.wait_for_clean() from 60s to 180s.
Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from quincy, pacific, octopus to quincy, pacific
Updated by Backport Bot over 1 year ago
- Copied to Backport #57025: quincy: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs added
Updated by Backport Bot over 1 year ago
- Copied to Backport #57026: pacific: test_pool_min_size:AssertionError:wait_for_clean:failed before timeout expired due to down PGs added
Updated by Neha Ojha about 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Laura Flores about 1 year ago
- Related to Bug #59172: test_pool_min_size: AssertionError: wait_for_clean: failed before timeout expired due to down PGs added