osd: throttle snap trimming
Qemu guests on our cluster experience high I/O latency, stalls or complete halts when spindle contention is created by Ceph performing maintenance work (scrub, deep-scrub, peering, recovery, or backfilling).
Under non-scrub conditions, a graph of util from 'iostat -x' shows consistent 10-15% load across all OSDs. Client i/o performance is good.
When scrub or deep-scrub starts, util on several or most OSDs approach 100 indicating spindle contention. Client i/o tends to have significant read latency and some guest becomes quite sluggish. Some applications experience multi-second pauses.
In the case of peering, recovery, and/or backfilling we see situations where client i/o will completely stall on some instances for the entire duration of the recovery, or can take the form of i/o starting/stopping seemingly randomly during the recovery. On other instances we see, i/o proceed at a fraction of the expected norm. Or we see a combination of these conditions.
Ceph should prioritize client i/o more effectively.
#2 Updated by David Zafman about 6 years ago
The fix for bug 6291 resolves an issue with recovery using more resources than it should.
A workaround is to disable deep-scrubbing which runs once a week be default. You could change osd_deep_scrub_interval configuration value to change this frequency.
To disable deep-scrub:
$ ceph osd set nodeep-scrub
Even with the nodeep-scrub flag still set you can manually initiate deep-scrub during off-hours using:
$ ceph osd deep-scrub '*'
#3 Updated by Mike Dawson about 6 years ago
For a few weeks, I've had to run with noscrub and nodeep-scrub, because both can cause 100% spindle contention resulting in client i/o suffering periodic stalls and/or limited throughput.
I'm hoping for a patch that prioritizes client i/o over maintenance work which would allow me to re-enable scrub and deep-scrub.
I'll test recovery with fix in #6291 as soon as I can.
#4 Updated by Andrei Mikhailovsky about 6 years ago
It seems to me that I am also effected by this issue with 3-5 virtual machines crashing on a daily basis with a lot of hang tasks showing up in dmesg output. In particular servers with java and mysqld processes seems to be more prone to crashes.
Disabling scrubbing is not ideal as it might case the long term data corruption.
Could you please priorities this problem as it makes a nightmare to maintain services running on ceph cluster.
#5 Updated by Wade Rencsok about 6 years ago
Its possible, that sql type workflows and ceph have an issue. I am going to investigate that on my own pre-production clusters where our SQL type customer workflows suffer significantly on the performance side compared to stellar performance of other workflows. My suspicion is that dependent on the database size, locking a table can lock far more drives(PG's/chunks) than anticipated.
#6 Updated by Andrei Mikhailovsky about 6 years ago
I think this issue is not related to the SQL locking. At least not in my case. I've done as suggested by Mike and disable scrubbing and deep scrubbing. So far, so good, the crashing servers survived overnight without any crashes or hang tasks (so far so good). Let's keep it that way.
Regarding the poor sql performance. I've noticed that my small ceph cluster is having a bunch of issues handling a large number of random 4k reads/writes. During my benchmarking tests (tried on iozone, fio and simple dd with iflag/oflag=direct) i am seeing extremely poor performance compared to the nfs filesystem which I previously had on the same hardware. While working on a small file (2-6GB in size which fits in the memory on the storage servers) I was getting between 60-70k 4k random read iops with the above benchmarks on the nfs. On ceph I can barely get past 2-2.5k, which is 25-30 times slower (((. Both nfs and ceph are serving that data from ram as I do not see any disk activity at all during the tests. So I had to conclude that at the current state ceph is not designed for working with databases or reading/writing a large number of very small files. Perhaps you need to open a new ticket for this as it doesn't seem to relate to the current bug.
#8 Updated by Andrei Mikhailovsky almost 6 years ago
Wade, could you please share with us what kernel tuning you've done? I would like to try it as I am still having some issues with the cluster after disabling scrubbing and deep scrubbing.
By the way, have you just disabled deep scrubbing or normal scrubbing as well?
#12 Updated by Olivier Bonvalet over 5 years ago
Hi,I see that a patch for this is present
- in dumpling (0.67.9) :
osd: allow snap trim throttling with simple delay (#6278, Sage Weil)
- and firefly (0.80.1) :
osd: add simple throttle for snap trimming (Sage Weil)
I upgrade the cluster to 0.67.9 yesterday (from 0.67.8), but don't see any difference on IO monitoring. How can we enable/tune this «simple delay» ?
#13 Updated by Olivier Bonvalet over 5 years ago
Ok, I found the commit : https://github.com/ceph/ceph/commit/4e5e41deeaf91c885773d90e6f94da60f6d4efd3
We have to set «osd_snap_trim_sleep» which by default is 0 (no change in behavior). I will try to tune that, thanks a lot !
#15 Updated by Sage Weil over 5 years ago
Olivier Bonvalet wrote:
Sorry for noise, one precision please : if I want to throttle to N io per OSD, the correct formula is : 1 * osd_disk_threads / N, right ?
osd_snap_trim_sleep is actually just a time value, in seconds, that the OSD will sleep before submitting the next snap trim transaction. It's a very coarse knob, but it does do the trick to reduce the snap trim load. I would start with something like .005 and go up for down from there. You can inject this value into running ceph-osd daemons to modify behavior on the fly (ceph daemon osd.NNN config set osd_snap_trim_sleep .005)