Fix #6278: osd: throttle snap trimming - Ceph - Ceph

Actions

Copy link

Fix #6278

closed

osd: throttle snap trimming

Added by Mike Dawson over 10 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Target version:

0.85

% Done:

Source:

Community (user)

Tags:

Backport:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Qemu guests on our cluster experience high I/O latency, stalls or complete halts when spindle contention is created by Ceph performing maintenance work (scrub, deep-scrub, peering, recovery, or backfilling).

Under non-scrub conditions, a graph of util from 'iostat -x' shows consistent 10-15% load across all OSDs. Client i/o performance is good.

When scrub or deep-scrub starts, util on several or most OSDs approach 100 indicating spindle contention. Client i/o tends to have significant read latency and some guest becomes quite sluggish. Some applications experience multi-second pauses.

In the case of peering, recovery, and/or backfilling we see situations where client i/o will completely stall on some instances for the entire duration of the recovery, or can take the form of i/o starting/stopping seemingly randomly during the recovery. On other instances we see, i/o proceed at a fraction of the expected norm. Or we see a combination of these conditions.

Ceph should prioritize client i/o more effectively.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Mike Dawson over 10 years ago

We are collecting perf dump metrics from all OSD and RBD admin sockets. What metrics would be the most useful to graph/analyze in tracking down these RBD client i/o stalls and halts?

Actions

Copy link

Updated by David Zafman over 10 years ago

The fix for bug 6291 resolves an issue with recovery using more resources than it should.

A workaround is to disable deep-scrubbing which runs once a week be default. You could change osd_deep_scrub_interval configuration value to change this frequency.

To disable deep-scrub:
$ ceph osd set nodeep-scrub

Even with the nodeep-scrub flag still set you can manually initiate deep-scrub during off-hours using:
$ ceph osd deep-scrub '*'

Actions

Copy link

Updated by Mike Dawson over 10 years ago

For a few weeks, I've had to run with noscrub and nodeep-scrub, because both can cause 100% spindle contention resulting in client i/o suffering periodic stalls and/or limited throughput.

I'm hoping for a patch that prioritizes client i/o over maintenance work which would allow me to re-enable scrub and deep-scrub.

I'll test recovery with fix in #6291 as soon as I can.

Actions

Copy link

Updated by Andrei Mikhailovsky over 10 years ago

It seems to me that I am also effected by this issue with 3-5 virtual machines crashing on a daily basis with a lot of hang tasks showing up in dmesg output. In particular servers with java and mysqld processes seems to be more prone to crashes.

Disabling scrubbing is not ideal as it might case the long term data corruption.

Could you please priorities this problem as it makes a nightmare to maintain services running on ceph cluster.

Thanks

Actions

Copy link

Updated by Wade Rencsok over 10 years ago

Its possible, that sql type workflows and ceph have an issue. I am going to investigate that on my own pre-production clusters where our SQL type customer workflows suffer significantly on the performance side compared to stellar performance of other workflows. My suspicion is that dependent on the database size, locking a table can lock far more drives(PG's/chunks) than anticipated.

Actions

Copy link

Updated by Andrei Mikhailovsky over 10 years ago

Wade,

I think this issue is not related to the SQL locking. At least not in my case. I've done as suggested by Mike and disable scrubbing and deep scrubbing. So far, so good, the crashing servers survived overnight without any crashes or hang tasks (so far so good). Let's keep it that way.

Regarding the poor sql performance. I've noticed that my small ceph cluster is having a bunch of issues handling a large number of random 4k reads/writes. During my benchmarking tests (tried on iozone, fio and simple dd with iflag/oflag=direct) i am seeing extremely poor performance compared to the nfs filesystem which I previously had on the same hardware. While working on a small file (2-6GB in size which fits in the memory on the storage servers) I was getting between 60-70k 4k random read iops with the above benchmarks on the nfs. On ceph I can barely get past 2-2.5k, which is 25-30 times slower (((. Both nfs and ceph are serving that data from ram as I do not see any disk activity at all during the tests. So I had to conclude that at the current state ceph is not designed for working with databases or reading/writing a large number of very small files. Perhaps you need to open a new ticket for this as it doesn't seem to relate to the current bug.

Actions

Copy link

Hi,

I see that a patch for this is present

in dumpling (0.67.9) :

osd: allow snap trim throttling with simple delay (#6278, Sage Weil)

and firefly (0.80.1) :

osd: add simple throttle for snap trimming (Sage Weil)

I upgrade the cluster to 0.67.9 yesterday (from 0.67.8), but don't see any difference on IO monitoring. How can we enable/tune this «simple delay» ?

Actions

Copy link

#13

Updated by Olivier Bonvalet almost 10 years ago

Ok, I found the commit : https://github.com/ceph/ceph/commit/4e5e41deeaf91c885773d90e6f94da60f6d4efd3

We have to set «osd_snap_trim_sleep» which by default is 0 (no change in behavior). I will try to tune that, thanks a lot !

Actions

Copy link

#14

Updated by Olivier Bonvalet almost 10 years ago

Sorry for noise, one precision please : if I want to throttle to N io per OSD, the correct formula is : 1 * osd_disk_threads / N, right ?

Actions

Copy link

#15

Updated by Sage Weil almost 10 years ago

Olivier Bonvalet wrote:

Sorry for noise, one precision please : if I want to throttle to N io per OSD, the correct formula is : 1 * osd_disk_threads / N, right ?

osd_snap_trim_sleep is actually just a time value, in seconds, that the OSD will sleep before submitting the next snap trim transaction. It's a very coarse knob, but it does do the trick to reduce the snap trim load. I would start with something like .005 and go up for down from there. You can inject this value into running ceph-osd daemons to modify behavior on the fly (ceph daemon osd.NNN config set osd_snap_trim_sleep .005)

Actions

Copy link

#16

Updated by Sage Weil over 9 years ago

Target version set to 0.85

Actions

Copy link

#17

Updated by Sage Weil over 9 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Fix #6278

osd: throttle snap trimming

Updated by Mike Dawson over 10 years ago

Updated by David Zafman over 10 years ago

Updated by Mike Dawson over 10 years ago

Updated by Andrei Mikhailovsky over 10 years ago

Updated by Wade Rencsok over 10 years ago

Updated by Andrei Mikhailovsky over 10 years ago

Updated by Wade Rencsok over 10 years ago

Updated by Andrei Mikhailovsky over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by geraint jones about 10 years ago

Updated by Olivier Bonvalet almost 10 years ago

Updated by Olivier Bonvalet almost 10 years ago

Updated by Olivier Bonvalet almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil over 9 years ago

Updated by Sage Weil over 9 years ago