Fix #6278
closedosd: throttle snap trimming
0%
Description
Qemu guests on our cluster experience high I/O latency, stalls or complete halts when spindle contention is created by Ceph performing maintenance work (scrub, deep-scrub, peering, recovery, or backfilling).
Under non-scrub conditions, a graph of util from 'iostat -x' shows consistent 10-15% load across all OSDs. Client i/o performance is good.
When scrub or deep-scrub starts, util on several or most OSDs approach 100 indicating spindle contention. Client i/o tends to have significant read latency and some guest becomes quite sluggish. Some applications experience multi-second pauses.
In the case of peering, recovery, and/or backfilling we see situations where client i/o will completely stall on some instances for the entire duration of the recovery, or can take the form of i/o starting/stopping seemingly randomly during the recovery. On other instances we see, i/o proceed at a fraction of the expected norm. Or we see a combination of these conditions.
Ceph should prioritize client i/o more effectively.