Bug #5692: RADOS performance regression in 0.65 - Ceph - Ceph

Actions

Copy link

Bug #5692

closed

RADOS performance regression in 0.65

Added by Mark Nelson over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Looks like after some narrowing down that we have what appears to be a pretty serious write performance regression starting in 0.65. This mostly is affecting XFS and EXT4. RADOS Bench performance is impacted across all IO sizes, though worse at smaller IO sizes. Small writes are about 3x slower, while large writes are about 1/3rd slower.

According to Sage's release notes, here's what we changed in the OSD in 0.65:

- osd: do not use fadvise(DONTNEED) on XFS (data corruption on power cycle)
- osd: recovery and peering performance improvements
- osd: new writeback throttling (for less bursty write performance) (Sam Just) osd: ping/heartbeat on public and private interfaces
- osd: avoid osd flapping from asymmetric network failure
- osd: re-use partially deleted PG contents when present (Sam Just)
- osd: break blacklisted client watches (David Zafman) - See more at: http://ceph.com/releases/v0-65-released/#sthash.pzr25JaM.dpuf

Actions

Copy link

Updated by Sage Weil over 10 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Greg Farnum over 10 years ago

I'm placing my bets on this being the writeback throttle. Tunables:

+OPTION(filestore_wbthrottle_btrfs_bytes_start_flusher, OPT_U64, 10<<20)
+OPTION(filestore_wbthrottle_btrfs_bytes_hard_limit, OPT_U64, 100<<20)
+OPTION(filestore_wbthrottle_btrfs_ios_start_flusher, OPT_U64, 100)
+OPTION(filestore_wbthrottle_btrfs_ios_hard_limit, OPT_U64, 1000)
+OPTION(filestore_wbthrottle_btrfs_inodes_start_flusher, OPT_U64, 100)
+OPTION(filestore_wbthrottle_btrfs_inodes_hard_limit, OPT_U64, 1000)
+OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 10<<20)
+OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 100<<20)
+OPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 10)
+OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 100)
+OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 10)
+OPTION(filestore_wbthrottle_xfs_inodes_hard_limit, OPT_U64, 100)

Is it too much trouble to run some tests with these way up and see how it changes things? If they're the issue, then either things aren't behaving the way Sam thinks or our testing has been pure-journal (no backing disk) and our performance isn't where we think it is. :/

Actions

Copy link

Updated by Mark Nelson over 10 years ago

Sage pushed wip-5692 that changes the fsycs to fdatasyncs in wb throttle. This doesn't appear to have helped though.

The tests we are running are 4 concurrent rados bench instances with 128 total IOs in flight for 5 minutes. Journal partitions are 10GB each (yes, quite big) on 24 OSDs, so 240GB total. Write throughput with 4MB IOs in 0.64 is consistently pegged at 2GB/s during the whole test. In 0.65, performance quickly (within 10-20 seconds) drops from 2GB/s to around 1.4GB/s according to rados bench.

To see if the journals could be hiding underlying data store throughput, I went back and looked at the collectl data and used an awk script to sum the total write throughput to all 24 spinning disks every second. In 0.64, aggregate throughput to the disks was nearly always between 2.0GB/s and 2.1GB/s with only a couple of brief drops. Throughput never fell below 1.7GB/s except for the last two seconds of the test as writes were completing. With wip-5692, aggregate throughput to the disks ranged from 635MB/s to 1.85GB/s with significant variability across the duration of the test.

Actions

Copy link

Updated by Mark Nelson over 10 years ago

I should mention journals are on separate SSD drives (8 SSDs total, 3 journals per). Each SSD is capable of about 450MB/s sequential writes.

Actions

Copy link

Updated by Greg Farnum over 10 years ago

That probably means the limits are set too low, if the backing disks are providing about the same throughput as the measured performance (yay!).
If we're lucky then it's just the bytes hard limiters, which I notice are at 100MB (that seems awfully low, Sam!) and are much less likely to cause writeback timeout issues than racking up the random IO is.

Actions

Copy link

Updated by Mark Nelson over 10 years ago

I suspect it's more than just the byte limit. While large writes are degraded, small writes appear to be hurt even more.

Actions

Copy link

Updated by Sage Weil over 10 years ago

pushed wip-before, which is just prior to the wbthrottle merge, to confirm this is the source of the trouble.

Actions

Copy link

Updated by Mark Nelson over 10 years ago

Glad you did that Sage. I only have the 4MB results so far, but it's looking like the performance regression is still present in wip-before.

Actions

Copy link

Updated by Sage Weil over 10 years ago

hmm, there is the hashpspool option addition... i woudln't expect that matter for a rados bench workload, though. try setting osd_pool_default_flag_hashpspool = false on wip-before?

not much else that looks suspicious...

Actions

Copy link

#10

Updated by Mark Nelson over 10 years ago

tweaking osd_pool_default_flag_hashpspool doesn't seem to have had an effect on XFS. Interestingly it looks like wip-before is helping btrfs somewhat, though btrfs performance is relatively consistent across all of the tests relative to XFS and EXT4.

Actions

Copy link

#11

Updated by Sage Weil over 10 years ago

Assignee set to Mark Nelson

Actions

Copy link

#12

Updated by Mark Nelson over 10 years ago

Assignee deleted (~~Mark Nelson~~)

Looks like I jumped the gun last night and must have tried to install the debs for wip-before before gitbuilder was finished. The hashtags were wrong and I was still on wip-5692. A retest of wip-before is showing high performance, so we're back to wbthrottle as being the main culprit.

Actions

Copy link

#13

Updated by Greg Farnum over 10 years ago

I don't remember our small-IO performance numbers — is that disk activity you mentioned above accurate for small IO? If it is then yeah, we need to increase the IO and inode limits too.
I think Sam and I have had disputes about the amount of writeback that is appropriate to leave in-memory in the past, so I'll leave it there for now. ;)

Actions

Copy link

#14

Updated by Mark Nelson over 10 years ago

Ok, after a ton of testing, here are the values on our supermicro node where I stop seeing benefits across the different IO sizes:

"filestore_wbthrottle_xfs_bytes_start_flusher": "41943040",
  "filestore_wbthrottle_xfs_bytes_hard_limit": "419430400",
  "filestore_wbthrottle_xfs_ios_start_flusher": "500",
  "filestore_wbthrottle_xfs_ios_hard_limit": "5000",
  "filestore_wbthrottle_xfs_inodes_start_flusher": "500",
  "filestore_wbthrottle_xfs_inodes_hard_limit": "5000",

"filestore_wbthrottle_btrfs_bytes_start_flusher": "41943040",
  "filestore_wbthrottle_btrfs_bytes_hard_limit": "419430400",
  "filestore_wbthrottle_btrfs_ios_start_flusher": "500",
  "filestore_wbthrottle_btrfs_ios_hard_limit": "5000",
  "filestore_wbthrottle_btrfs_inodes_start_flusher": "500",
  "filestore_wbthrottle_btrfs_inodes_hard_limit": "5000",

XFS performance is still probably about 5%-8% slower than it was previously, but EXT4 and BTRFS still seem to be maxing (or close to maxing) out the bonded 10GbE link. If we want to relax these a bit, we don't lose too much performance reducing the inode and io limits by 20%.

Actions

Copy link

#15

Updated by Greg Farnum over 10 years ago

Yikes. Those limits take an awful lot of writeback time in order to flush to disk (5000 IOs/inodes is going to be ~50 seconds, right?). I'm not sure we want to go that far...
Were the increases in the start_flusher values necessary or just something you did? I'm somewhat surprised by them.

Actions

Copy link

#16

Updated by Mark Nelson over 10 years ago

Hi Greg,

Here's a run down of the XFS tests:

https://docs.google.com/a/inktank.com/spreadsheet/ccc?key=0AnmmfpoQ1_94dGhneXNKWmV2QWNsTXFTRDI1YlZqRnc#gid=0

It may be that the start flusher limits are what really matter. Definitely seem to see higher number correlated with higher values though.

Mark

Actions

Copy link

#17

Updated by Samuel Just over 10 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #5692

RADOS performance regression in 0.65

Updated by Sage Weil over 10 years ago

Updated by Greg Farnum over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Greg Farnum over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Greg Farnum over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Greg Farnum over 10 years ago

Updated by Mark Nelson over 10 years ago

Updated by Samuel Just over 10 years ago