Bug #57941: Severe performance drop after writing 100 GB of data to RBD volume, dependent on RAM on client, with 100% reproducible test case - rbd - Ceph

Actions

Copy link

Bug #57941

closed

Severe performance drop after writing 100 GB of data to RBD volume, dependent on RAM on client, with 100% reproducible test case

Added by Guillaume Pothier over 1 year ago. Updated over 1 year ago.

Status:

Rejected

Priority:

Normal

Assignee:

Christopher Hoffman

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Write throughput to a mapped RBD volume drops dramatically after the volume reaches a certain usage size. The amount of data that can be written before the drop, as well as the write throughput after the drop, both depend on the RAM of the Ceph client host.

Before the drop, the throughput is stable at around 100 - 130 MB/s. Here are the different drop characteristics so far, depending on client RAM size (server has 16 GB):

512 MB RAM: drop occurs at 100 GB, throughput drops to 3.8 MB/s
1GB RAM: drop occurs at 180 GB, throughput drops to 7.5 MB/s
2GB RAM: drop occurs at 300 GB, throughput drops to 15 MB/s

It seems strange that the amount of RAM on the client affects the throughput in this way. If there is a minimum amount of RAM on the Ceph client, it does not seem to be documented (this page documents the RAM requirements for the servers: https://docs.ceph.com/en/quincy/start/hardware-recommendations/#ram)

This is the setup:
- Server: Proxmox VE on a t3a.xlarge AWS EC2 instance (16 GB RAM), with a 1 TB EBS volume for Ceph; Ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
- Client: Debian Bullseye on different EC2 instance types with varying RAM sizes; Ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)

This repository contains an easy to run test case:
https://gitlab.com/guillaume.pothier/ceph-perf-testcase
(it has one command to create the infrastructure in AWS with Terraform, one command to set up Proxmox and Ceph, and one command to run the test case)

These are the steps of the test case:
- Create a thin provisioned RBD volume of 900 GB
- Map the volume
- Dump up to 800 GB from /dev/urandom data into the volume
- Observe a drop in write throughput after a certain amount of data is transferred

Also, read throughput suffers in a similar way, but I am still trying to create a 100% reproducible test case.

This thread might be related: https://forum.proxmox.com/threads/ceph-rbd-slow-down-read-write.55055/

Actions

Copy link

Updated by Christopher Hoffman over 1 year ago

Assignee set to Christopher Hoffman

I'm not familiar with PVE and how it sets up Ceph. I took a look at your testcase and it appears to be a sequential workload. With the information I have it appears to be working correctly. There are multiple factors at play before reaching persistent storage.

Here's my theory;
There's caching that happens on the client (https://docs.ceph.com/en/latest/rbd/rbd-config-ref/#cache-settings). When the cache gets to a certain level of fullness, it will send data to ceph servers. As stated in the link, it will coalesce contiguous requests for better throughput. As you increase available cache size or ram, the data writes will become bigger and more efficient to ceph servers.

On the server side, ceph collects these writes and may group them together before writing to OSDs/persistent storage. Over time, server side ram will fill up as the data it received can't get written fast enough to persistent storage. At that point there will be a slowdown as there's cache pressure and no addition place for the client I/O to go.

Actions

Copy link

Updated by Christopher Hoffman over 1 year ago

Status changed from New to In Progress

Actions

Copy link

Updated by Guillaume Pothier over 1 year ago

Thanks for looking into this Christopher. You are right, this is a 100% sequential workload, just filling a volume with random data. I don't think caching has to do with the throughput drop, or at least not caching of the data that is being written to the volume: the amount of data we write is orders of magnitude larger than the amount of RAM of either the client or the server machines, so the cache should be full after only a few seconds. However, we observe a high throughput during dozens of minutes or even hours before the drop occurs. And that high throughput (100 - 130 MB/s) is actually more or less in line with what can be expected of an SSD drive; it is probably even higher during the first few seconds, until the caches are full.

Without knowing anything about the internals of Ceph or RBD, I was thinking that maybe the Ceph client needs to maintain a kind of mapping of all the blocks that constitute the volume, and if so, the bigger the volume becomes, the larger that mapping needs to be, which would explain why the drop occurs once the volume reaches a certain size depending on available RAM on the client (assuming that if the mapping cannot fit in client RAM it has to be retrieved from the server, thus killing the throughput). Of course that is just wild speculation on my part.

Actions

Copy link

Updated by Christopher Hoffman over 1 year ago

In order to rule out if client side caching is a factor can you do a few tests?

Run tests without client side caching:
1. Write to device using kernel with direct=1.
fio --name=write --filename=/dev/rbd0 --ioengine=libaio --iodepth=1 --rw=write --bs=1M --direct=1 --size=800G --group_reporting

2. Write to image in user space using librbd with direct=1
fio --name=write --ioengine=rbd --clientname=admin --pool=rbd --rbdname=fio_test --invalidate=0 --rw=write --bs=1M --iodepth=1 --direct=1 --size=800G --group_reporting

Run test with client side caching:
3. Run test 1 and 2 with direct=0

Actions

Copy link

Updated by Guillaume Pothier over 1 year ago

:facepalm: the issue has nothing to do with Ceph and everything to do with smaller EC2 instances having burstable network bandwidth, proportional to the number of vCPUs, which is incidentally also proportional to the available RAM. I was not aware of that. Such instances are allowed to use the maximum bandwidth for some time, after which they are throttled to a much lower bandwidth, hence the drop in throughput I was observing. Sorry for the noise.

For reference, this is documented here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/general-purpose-instances.html#general-purpose-network-performance

Actions

Copy link