Fix #6278

osd: throttle snap trimming

Added by Mike Dawson over 7 years ago. Updated over 6 years ago.

Target version:
% Done:


Community (user)
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Qemu guests on our cluster experience high I/O latency, stalls or complete halts when spindle contention is created by Ceph performing maintenance work (scrub, deep-scrub, peering, recovery, or backfilling).

Under non-scrub conditions, a graph of util from 'iostat -x' shows consistent 10-15% load across all OSDs. Client i/o performance is good.

When scrub or deep-scrub starts, util on several or most OSDs approach 100 indicating spindle contention. Client i/o tends to have significant read latency and some guest becomes quite sluggish. Some applications experience multi-second pauses.

In the case of peering, recovery, and/or backfilling we see situations where client i/o will completely stall on some instances for the entire duration of the recovery, or can take the form of i/o starting/stopping seemingly randomly during the recovery. On other instances we see, i/o proceed at a fraction of the expected norm. Or we see a combination of these conditions.

Ceph should prioritize client i/o more effectively.

Related issues

Duplicated by Ceph - Bug #6826: Non-equal performance of 'freshly joined' OSDs Duplicate 11/20/2013


#1 Updated by Mike Dawson over 7 years ago

We are collecting perf dump metrics from all OSD and RBD admin sockets. What metrics would be the most useful to graph/analyze in tracking down these RBD client i/o stalls and halts?

#2 Updated by David Zafman over 7 years ago

The fix for bug 6291 resolves an issue with recovery using more resources than it should.

A workaround is to disable deep-scrubbing which runs once a week be default. You could change osd_deep_scrub_interval configuration value to change this frequency.

To disable deep-scrub:
$ ceph osd set nodeep-scrub

Even with the nodeep-scrub flag still set you can manually initiate deep-scrub during off-hours using:
$ ceph osd deep-scrub '*'

#3 Updated by Mike Dawson over 7 years ago

For a few weeks, I've had to run with noscrub and nodeep-scrub, because both can cause 100% spindle contention resulting in client i/o suffering periodic stalls and/or limited throughput.

I'm hoping for a patch that prioritizes client i/o over maintenance work which would allow me to re-enable scrub and deep-scrub.

I'll test recovery with fix in #6291 as soon as I can.

#4 Updated by Andrei Mikhailovsky over 7 years ago

It seems to me that I am also effected by this issue with 3-5 virtual machines crashing on a daily basis with a lot of hang tasks showing up in dmesg output. In particular servers with java and mysqld processes seems to be more prone to crashes.

Disabling scrubbing is not ideal as it might case the long term data corruption.

Could you please priorities this problem as it makes a nightmare to maintain services running on ceph cluster.


#5 Updated by Wade Rencsok over 7 years ago

Its possible, that sql type workflows and ceph have an issue. I am going to investigate that on my own pre-production clusters where our SQL type customer workflows suffer significantly on the performance side compared to stellar performance of other workflows. My suspicion is that dependent on the database size, locking a table can lock far more drives(PG's/chunks) than anticipated.

#6 Updated by Andrei Mikhailovsky over 7 years ago


I think this issue is not related to the SQL locking. At least not in my case. I've done as suggested by Mike and disable scrubbing and deep scrubbing. So far, so good, the crashing servers survived overnight without any crashes or hang tasks (so far so good). Let's keep it that way.

Regarding the poor sql performance. I've noticed that my small ceph cluster is having a bunch of issues handling a large number of random 4k reads/writes. During my benchmarking tests (tried on iozone, fio and simple dd with iflag/oflag=direct) i am seeing extremely poor performance compared to the nfs filesystem which I previously had on the same hardware. While working on a small file (2-6GB in size which fits in the memory on the storage servers) I was getting between 60-70k 4k random read iops with the above benchmarks on the nfs. On ceph I can barely get past 2-2.5k, which is 25-30 times slower (((. Both nfs and ceph are serving that data from ram as I do not see any disk activity at all during the tests. So I had to conclude that at the current state ceph is not designed for working with databases or reading/writing a large number of very small files. Perhaps you need to open a new ticket for this as it doesn't seem to relate to the current bug.

#7 Updated by Wade Rencsok over 7 years ago

We seemed to hit this issue (not the sql herring) when deep scrubs kicked off. Disabling all scrubbing and some kernel tuning has kept the cluster stable so far. If we have the issue again, i'll update this bug.

#8 Updated by Andrei Mikhailovsky over 7 years ago

Wade, could you please share with us what kernel tuning you've done? I would like to try it as I am still having some issues with the cluster after disabling scrubbing and deep scrubbing.

By the way, have you just disabled deep scrubbing or normal scrubbing as well?

#9 Updated by Sage Weil over 7 years ago

  • Tracker changed from Bug to Fix
  • Project changed from rbd to Ceph
  • Subject changed from QEMU/RBD Client I/O Stalls or Halts Due to Spindle Contention from Ceph Maintainance to osd: throttle snap trimming
  • Target version set to v0.73

#10 Updated by Sage Weil over 7 years ago

  • Target version deleted (v0.73)

#11 Updated by geraint jones about 7 years ago

We are seeing the very same issue. Is there an ETA or can we sponsor someone to fix it :D ?

#12 Updated by Olivier Bonvalet almost 7 years ago


I see that a patch for this is present
  • in dumpling (0.67.9) :

osd: allow snap trim throttling with simple delay (#6278, Sage Weil)

  • and firefly (0.80.1) :

osd: add simple throttle for snap trimming (Sage Weil)

I upgrade the cluster to 0.67.9 yesterday (from 0.67.8), but don't see any difference on IO monitoring. How can we enable/tune this «simple delay» ?

#13 Updated by Olivier Bonvalet almost 7 years ago

Ok, I found the commit :

We have to set «osd_snap_trim_sleep» which by default is 0 (no change in behavior). I will try to tune that, thanks a lot !

#14 Updated by Olivier Bonvalet almost 7 years ago

Sorry for noise, one precision please : if I want to throttle to N io per OSD, the correct formula is : 1 * osd_disk_threads / N, right ?

#15 Updated by Sage Weil almost 7 years ago

Olivier Bonvalet wrote:

Sorry for noise, one precision please : if I want to throttle to N io per OSD, the correct formula is : 1 * osd_disk_threads / N, right ?

osd_snap_trim_sleep is actually just a time value, in seconds, that the OSD will sleep before submitting the next snap trim transaction. It's a very coarse knob, but it does do the trick to reduce the snap trim load. I would start with something like .005 and go up for down from there. You can inject this value into running ceph-osd daemons to modify behavior on the fly (ceph daemon osd.NNN config set osd_snap_trim_sleep .005)

#16 Updated by Sage Weil over 6 years ago

  • Target version set to 0.85

#17 Updated by Sage Weil over 6 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF