Actions
Bug #54396
closedSetting osd_pg_max_concurrent_snap_trims to 0 prematurely clears the snaptrim queue
% Done:
0%
Source:
Tags:
Backport:
octopus,pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
See https://www.spinics.net/lists/ceph-users/msg71061.html
This time around, after a few hours of snaptrimming, users complained of high IO latency, and indeed Ceph reported "slow ops" on a number of OSDs and on the active MDS. I attributed this to the snaptrimming and decided to reduce it by initially setting osd_pg_max_concurrent_snap_trims to 1, which didn't seem to help much, so I then set it to 0, which had the surprising effect of transitioning all PGs back to active+clean (is this intended?). I also restarted the MDS which seemed to be struggling. IO latency went back to normal immediately.
In the code, when osd_pg_max_concurrent_snap_trims is 0, PrimaryLogPG::AwaitAsyncWork::react(const DoSnapWork&) calls pg->snap_mapper.get_next_objects_to_trim looking for 0 snaps to trim. But pg->snap_mapper.get_next_objects_to_trim returns ENOENT in this case, then DoSnapWork erases the remaining snap_to_trim.
Actions