Project

General

Profile

Bug #57043

Snaptrimmer can ignore osd_snap_trim_sleep

Added by Kellen Renshaw over 1 year ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Observed on Ceph 16.2.6 upgraded to 16.2.9. The cluster experiences delays in the async messenger processing OSD heartbeats, resulting in OSDs flapping and eventually suiciding. Mitigated by setting the nosnaptrim flag on the cluster, which allows the cluster to stabilize.

Traced to https://tracker.ceph.com/issues/52026 causing snap trims to be significantly backed up for several PGs in the cluster. On repeer/OSD restart, the snaptrimming processing swamps the async messenger with pg_info2 messages, causing heartbeats to be delayed leading to flapping. This happens even with OSD heartbeat grace set to 100 seconds.

Increasing osd_snap_trim_sleep to 5.0 had no noticeable impact on the flapping behavior. Traced this issue to logic in boost::statechart::result PrimaryLogPG::AwaitAsyncWork::react(const DoSnapWork&) in PrimaryLogPG.cc (git tag v16.2.9), specifically line 15316, where the sleep is skipped if get_next_objects_to_trim() returns -ENOENT.

The snaptrimming process removes the snap from the purged_list, runs share_pg_info(), and restarts the state machine, ignoring osd_snap_trim_sleep. On this cluster, this results in the snaptrimming process swamping the async messenger process with large pg_info2 messages.

History

#1 Updated by Pawel Stefanski 7 months ago

Hello, I do think got very similar or the same situation. Have you found any parameter that helps to workaround this or can we just wait for a fix of logic there ?

Btw. Do you have maybe and osd debug 20 log from the affected osd ?

Also available in: Atom PDF