Project

General

Profile

Actions

Bug #57941

closed

Severe performance drop after writing 100 GB of data to RBD volume, dependent on RAM on client, with 100% reproducible test case

Added by Guillaume Pothier over 1 year ago. Updated over 1 year ago.

Status:
Rejected
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Write throughput to a mapped RBD volume drops dramatically after the volume reaches a certain usage size. The amount of data that can be written before the drop, as well as the write throughput after the drop, both depend on the RAM of the Ceph client host.

Before the drop, the throughput is stable at around 100 - 130 MB/s. Here are the different drop characteristics so far, depending on client RAM size (server has 16 GB):

512 MB RAM: drop occurs at 100 GB, throughput drops to 3.8 MB/s
1GB RAM: drop occurs at 180 GB, throughput drops to 7.5 MB/s
2GB RAM: drop occurs at 300 GB, throughput drops to 15 MB/s

It seems strange that the amount of RAM on the client affects the throughput in this way. If there is a minimum amount of RAM on the Ceph client, it does not seem to be documented (this page documents the RAM requirements for the servers: https://docs.ceph.com/en/quincy/start/hardware-recommendations/#ram)

This is the setup:
- Server: Proxmox VE on a t3a.xlarge AWS EC2 instance (16 GB RAM), with a 1 TB EBS volume for Ceph; Ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
- Client: Debian Bullseye on different EC2 instance types with varying RAM sizes; Ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)

This repository contains an easy to run test case:
https://gitlab.com/guillaume.pothier/ceph-perf-testcase
(it has one command to create the infrastructure in AWS with Terraform, one command to set up Proxmox and Ceph, and one command to run the test case)

These are the steps of the test case:
- Create a thin provisioned RBD volume of 900 GB
- Map the volume
- Dump up to 800 GB from /dev/urandom data into the volume
- Observe a drop in write throughput after a certain amount of data is transferred

Also, read throughput suffers in a similar way, but I am still trying to create a 100% reproducible test case.

This thread might be related: https://forum.proxmox.com/threads/ceph-rbd-slow-down-read-write.55055/

Actions

Also available in: Atom PDF