Project

General

Profile

Actions

Bug #17213

closed

As the cluster is filling up, write performance decreases

Added by Anonymous over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Our setup consists of 5 nodes (HP Apollo 4510) with:
  • 2 x 16-core socket and more than 256 GB RAM.
  • Ceph disks are 12 x 4TB 7200 rpms SATA disks each node.
  • Reproduced with both RHCS 2.0, Jewel and Infernalis releases.
  • Every node is hosting 1 monitor and 12 OSDs.
  • Journals are co-located with data on each OSD disk.
  • 2 bonded 10 Gbps NICs for both Ceph private and public networks.
  • XFS for the filestore back-end.
  • Running RHEL 7.2.
  • Pool under test is using EC ISA plugin with k=4 m=1 profile.

We are currently using 30-min length rados benchmarks to test the writing speed of our Ceph cluster using both RHCS 2.0, Jewel and Infernalis.

When we do so, we have a write performance degradation that occurs when the cluster is filling up:
  • While the cluster is below 7% capacity, the throughput is 13 Gbps/node, but at 28% capacity it is below 7 Gbps.
  • The write performance decreases fast until disk filling arrives at 20%.
  • Then it decreases at lower speed until disk filling reaches 40%,
  • And we eventually reach a stable write speed at 5.5 Gbps when disk filling is above 40%.

"iostat" shows that when the pool is empty, IO workload is driven by writes. However, as soon as the pool occupation increases (1 hour later, at constant write rate), more and more reads start to happen. Reads start, when the pool is empty, at a rate of 1 rd/s per HDD (vs. 150-200 write/s) , and then evolve until reaching 50-100 rd/s, and counting half of the total iops.

While there are several points to improve in our setup (SDD instead of HDD, journal on SSDs), we are mostly concerned about the fact that the actual performance varies depending on the cluster occupation.

In order to avoid issues with the directory merges, we already set up our pool with the number of expected objects of 1000000000, and a negative filestore_merge_threshold, to allow pre-allocation of the directory structure, but this did not work either.


Files

ceph.conf (2.18 KB) ceph.conf Anonymous, 09/05/2016 06:24 PM
Actions #1

Updated by huang jun over 7 years ago

can you paste your rados bench command? do you have any overwrite operations?
it's werid there are so many reads after osd used capacity grow up.
you test the raw ec pool? not use cache tier?

you have 256GB RAM, maybe the 13Gbps mainly caused by it.

Actions #2

Updated by Anonymous over 7 years ago

This is the script we are using:

pool_name=test_pool
time=1800                # Seconds to run
ios=512                 # Concurrent operations
type="write"            # Type of test ( read/write )
sleeptime=120            # Seconds to wait between tests

for test in {000..999}
do
for i in 03500000
do
  resultfile="bench_oa_${i}_${time}_${type}_${ios}_${test}" 
  echo "Testing filesize $i bytes results in file $resultfile" 
  rados bench -b $i -p $pool_name $time $type --no-cleanup --concurrent-ios=$ios --run-name $resultfile >> $resultfile 2>&1
  sleep $sleeptime
done
done

No overwrites, no cache tier, no reads at all. Scrub and deep-scrub are disabled.

Actions #3

Updated by Anonymous over 7 years ago

Any news on this issue? Have you had the chance to check this? Thanks!

Actions #4

Updated by Samuel Just over 7 years ago

  • Status changed from New to Closed

On the face of it, it's not really surprising that an empty xfs is faster than a full one. With filestore, we split the directories containing the files backing the objects as they fill up. I expect that the reads are simply due to having to do some directory lookups before you can do the write. You might be able to improve matters by fiddling with the filestore split values or by changing the linux caching behavior to favor directories.

Bluestore should significantly improve this behavior (and eliminate the double write!)

Without more information, this appears to be expected behavior for filestore, so I'm closing the bug.

Actions #5

Updated by Anonymous over 7 years ago

All dir tree structure should have been pre-allocated on pool creation, as we used the 'ceph osd pool create' syntax (and corresponding filestore split/merge settings) that allow for this preallocation to happen (http://docs.ceph.com/docs/jewel/rados/operations/pools/#create-a-pool). No new dirs would be expected to become created as long as the expected number of objects remains under the threshold set on pool creation.

Regarding your comment on XFS behaviour, is that something only happening to XFS (vs. btrfs/ext4)? If so, could you please point me to the bug description on this issue (I haven't been able to find it as an XFS well-known issue).

Thanks!

Actions

Also available in: Atom PDF