Project

General

Profile

Actions

Bug #57898

open

ceph client extremely slow kernel version between 5.15 and 6.0

Added by Minjong Kim over 1 year ago. Updated over 1 year ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
performance
Crash signature (v1):
Crash signature (v2):

Description

hello? I am very new to ceph. Thank you for taking that into consideration and reading.

I recently changed the kernel to enjoy ceph's client code. For the first time I uploaded a newly built kernel to the VM for isolation. (After that, the same problem occurred in the official kernel of ubuntu, so I will skip my rudimentary kernel build process). However, compared to running on host (docker), there was a huge performance drop and I suspected this was due to my rudimentary VM network setup.

So I gave up the VM and replaced the kernel of the host of another machine with my custom kernel. But I got the same performance degradation. I was equally suspicious of myself, so I checked whether the same phenomenon occurs even if I build the 5.15 kernel (the same version as the Ubuntu 22.04 kernel) in the same way. But it worked fine with kernel version 5.15. However, my suspicion is still not resolved, I downloaded Ubuntu's mainline kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.0/) and installed the kernel, but it was equally slow.

I'm trying to limit the performance drop to one use case to present quantitatively. There is already saved imagenet dataset (1000 directories with a total of 1.2 million files, each file on average 100KB). I'll present the average of the logged ceph_mds_request measurands when deleting it (rm /mnt/ceph/imagenet -r).

Kernel 6.0
mean of ceph_mds_requests: 6053
Kernel 5.15
mean of ceph_mds_requests: 131 (not a typo)

I ran the same test by mounting via ceph-fuse to comply with the standard troubleshooting methods presented in various issues. But this process caused another confusion for me. Performance is restored by performing the deletion in the following order.

1. kernel mount ceph (mount -t ceph ... /mnt/ceph)
2. fuse mount ceph (ceph-fuse ... /mnt/ceph-fuse)
3. delete via fuse mount (rm /mnt/ceph-fuse/imagenet -r)
4. Interrupt after a while
5. delete via kernel mount (rm /mnt/ceph/imagenet -r)

This was reproduced through several iterative tests with the docker container.

Let's summarize. The point of my issue is this.

- Performance degradation occurred in kernel version 5.15<x<=6.0. (roughly about 40 times)
- This was tested several times on two hosts,
- Also, the method related to ceph-fuse was repeated several times through container,

I'm new to both ceph and kernel, so it may have been my invisible mistake. However, I tried my best to control the variables.

Here is the experimental environment.

- 3 nodes and 4 OSDs per node
- 1 MDS globally
- OSD allocates both blocks and metadata as ramdisks
- No special options except bluefs_buffered_io = False
- Replication is not set. (for testing)
- ceph version v17.2.0

Actions

Also available in: Atom PDF