Bug #57898
ceph client extremely slow kernel version between 5.15 and 6.0
0%
Description
hello? I am very new to ceph. Thank you for taking that into consideration and reading.
I recently changed the kernel to enjoy ceph's client code. For the first time I uploaded a newly built kernel to the VM for isolation. (After that, the same problem occurred in the official kernel of ubuntu, so I will skip my rudimentary kernel build process). However, compared to running on host (docker), there was a huge performance drop and I suspected this was due to my rudimentary VM network setup.
So I gave up the VM and replaced the kernel of the host of another machine with my custom kernel. But I got the same performance degradation. I was equally suspicious of myself, so I checked whether the same phenomenon occurs even if I build the 5.15 kernel (the same version as the Ubuntu 22.04 kernel) in the same way. But it worked fine with kernel version 5.15. However, my suspicion is still not resolved, I downloaded Ubuntu's mainline kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.0/) and installed the kernel, but it was equally slow.
I'm trying to limit the performance drop to one use case to present quantitatively. There is already saved imagenet dataset (1000 directories with a total of 1.2 million files, each file on average 100KB). I'll present the average of the logged ceph_mds_request measurands when deleting it (rm /mnt/ceph/imagenet -r).
Kernel 6.0
mean of ceph_mds_requests: 6053
Kernel 5.15
mean of ceph_mds_requests: 131 (not a typo)
I ran the same test by mounting via ceph-fuse to comply with the standard troubleshooting methods presented in various issues. But this process caused another confusion for me. Performance is restored by performing the deletion in the following order.
1. kernel mount ceph (mount -t ceph ... /mnt/ceph)
2. fuse mount ceph (ceph-fuse ... /mnt/ceph-fuse)
3. delete via fuse mount (rm /mnt/ceph-fuse/imagenet -r)
4. Interrupt after a while
5. delete via kernel mount (rm /mnt/ceph/imagenet -r)
This was reproduced through several iterative tests with the docker container.
Let's summarize. The point of my issue is this.
- Performance degradation occurred in kernel version 5.15<x<=6.0. (roughly about 40 times)
- This was tested several times on two hosts,
- Also, the method related to ceph-fuse was repeated several times through container,
I'm new to both ceph and kernel, so it may have been my invisible mistake. However, I tried my best to control the variables.
Here is the experimental environment.
- 3 nodes and 4 OSDs per node
- 1 MDS globally
- OSD allocates both blocks and metadata as ramdisks
- No special options except bluefs_buffered_io = False
- Replication is not set. (for testing)
- ceph version v17.2.0
History
#1 Updated by Minjong Kim over 1 year ago
Even with the ceph-fuse method in the body it gets slow again over time.
#2 Updated by Minjong Kim over 1 year ago
Hello again
I don't know if anyone is interested, but when tested with an already built kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/), this slowdown does not occur in v5.19.17 and in v6.0.0-rc3 occurs. If I have time later, I'll build and test it by commit. No promises can be made.
Thanks
#3 Updated by Xiubo Li over 1 year ago
Could you upload your test script ?
Do you mean you can also reproduce this by using the ceph-fuse mount, right ?
#4 Updated by Minjong Kim over 1 year ago
ceph I used the ceph kernel mount. In fuse-mount it works fine.
The test script is nothing special. I just did the cp and rm commands on the imagenet dataset. It's very fast for kernel 5.19.17 and very slow for 6.0.0-rc3.
The imagenet dataset has 1,200,000 images in 1000 directories, with a total size of about 127GB, but this doesn't seem to matter. (because it's slow from the start)
It seems that it can be reproduced by copying or deleting several files of less than 100KB in one directory. Need to write a simple script?
#5 Updated by Xiubo Li over 1 year ago
Minjong Kim wrote:
ceph I used the ceph kernel mount. In fuse-mount it works fine.
The test script is nothing special. I just did the cp and rm commands on the imagenet dataset. It's very fast for kernel 5.19.17 and very slow for 6.0.0-rc3.
The imagenet dataset has 1,200,000 images in 1000 directories, with a total size of about 127GB, but this doesn't seem to matter. (because it's slow from the start)
It seems that it can be reproduced by copying or deleting several files of less than 100KB in one directory. Need to write a simple script?
Yeah, please.
BTW, have ever test the latest ceph code in ceph-client [1] repo in testing ?
#6 Updated by Minjong Kim over 1 year ago
Xiubo Li wrote:
Minjong Kim wrote:
ceph I used the ceph kernel mount. In fuse-mount it works fine.
The test script is nothing special. I just did the cp and rm commands on the imagenet dataset. It's very fast for kernel 5.19.17 and very slow for 6.0.0-rc3.
The imagenet dataset has 1,200,000 images in 1000 directories, with a total size of about 127GB, but this doesn't seem to matter. (because it's slow from the start)
It seems that it can be reproduced by copying or deleting several files of less than 100KB in one directory. Need to write a simple script?
Yeah, please.
BTW, have ever test the latest ceph code in ceph-client [1] repo in testing ?
I've never built and tested the most recent ceph-client repo myself. However, I have tried running it with 6.1.0-rc1 in the ubuntu mainline kernel, but the performance was the same. As a result of checking, the 6.1.0-rc1 tag of torvalds/linux and the for-linus branch of ceph/ceph-clinet seem to be the same version.
#7 Updated by Minjong Kim over 1 year ago
But I haven't checked the testing branch. (I'll check)
#8 Updated by Minjong Kim over 1 year ago
https://gist.github.com/caffeinism/dbfd974374d620911a6c0c3dd1daadfb
I am not good at writing files in a shell script (testing it doesn't seem to write blocks properly...), so I wrote it in a Python script. sorry. This script is designed to create a folder structure similar to imagenet by default. I've tried running it on 6.0.3 and 5.19.17 and it reproduces similarly. (although faster than cp).
python3 test.py /mnt/cephfs/some-directory
#9 Updated by Minjong Kim over 1 year ago
Xiubo Li wrote:
Minjong Kim wrote:
ceph I used the ceph kernel mount. In fuse-mount it works fine.
The test script is nothing special. I just did the cp and rm commands on the imagenet dataset. It's very fast for kernel 5.19.17 and very slow for 6.0.0-rc3.
The imagenet dataset has 1,200,000 images in 1000 directories, with a total size of about 127GB, but this doesn't seem to matter. (because it's slow from the start)
It seems that it can be reproduced by copying or deleting several files of less than 100KB in one directory. Need to write a simple script?
Yeah, please.
BTW, have ever test the latest ceph code in ceph-client [1] repo in testing ?
I've tried building and running the testing branch and it seems to be equally slow.
#10 Updated by Minjong Kim over 1 year ago
Also, it seems that requests to mds are much slower than writing blocks. When I run the rm command, it sends an average of 5-6000 commands in the 5.15 kernel, an average of 10000 commands in the 5.19 kernel, and an average of 120 commands in the 6.0 kernel. But when it comes to copying, it's not that much of a degradation.
And a question unrelated to this issue, what improvements did ceph-client get from 5.15 to 5.19? I can feel the performance improvement of 1.5 to 2 times perceptually. It seems that the writeback mechanism has changed along with the speed improvement of the MDS client. This no longer saturates the maximum bandwidth of the network... I spent a lot of time trying to solve this.
#11 Updated by Xiubo Li over 1 year ago
- Assignee set to Xiubo Li
#12 Updated by Xiubo Li over 1 year ago
Minjong Kim wrote:
https://gist.github.com/caffeinism/dbfd974374d620911a6c0c3dd1daadfb
I am not good at writing files in a shell script (testing it doesn't seem to write blocks properly...), so I wrote it in a Python script. sorry. This script is designed to create a folder structure similar to imagenet by default. I've tried running it on 6.0.3 and 5.19.17 and it reproduces similarly. (although faster than cp).
python3 test.py /mnt/cephfs/some-directory
Thanks MinJong, I will take a look.
#13 Updated by Xiubo Li over 1 year ago
I saw the same issue when testing the `6.1.0-rc1` upstream code:
[xfstests-dev]# sudo ./check -g quick -E ./ceph.exclude FSTYP -- ceph PLATFORM -- Linux/x86_64 lxbceph1 6.1.0-rc1+ #174 SMP PREEMPT_DYNAMIC Mon Nov 14 09:56:19 CST 2022 MKFS_OPTIONS -- 10.72.47.117:40864:/testB MOUNT_OPTIONS -- -o name=admin,nowsync,copyfrom,rasize=4096 -o context=system_u:object_r:root_t:s0 10.72.47.117:40864:/testB /mnt/kcephfs.B ceph/001 [expunged] ceph/002 62s ... 169s ceph/003 75s ... 184s ceph/004 82s ... 171s ceph/005 63s ... 166s generic/001 98s ... 327s generic/002 60s ... 174s generic/003 [expunged] generic/004 [not run] O_TMPFILE is not supported generic/005 64s ... 167s generic/006 79s ... 211s generic/007 114s ... 280s generic/008 [not run] xfs_io fzero failed (old kernel/wrong fs?) generic/009 [not run] xfs_io fzero failed (old kernel/wrong fs?) generic/011 83s ... 212s generic/012 [not run] xfs_io falloc failed (old kernel/wrong fs?) generic/013 242s ... 421s generic/014 442s ... 764s generic/015 [not run] Filesystem ceph not supported in _scratch_mkfs_sized generic/016 [not run] xfs_io falloc failed (old kernel/wrong fs?) generic/018 [not run] defragmentation not supported for fstype "ceph" generic/020 97s ... 271s generic/021 [not run] xfs_io falloc failed (old kernel/wrong fs?) generic/022 [not run] xfs_io falloc failed (old kernel/wrong fs?) generic/023 66s ... 225s generic/024 [not run] kernel doesn't support renameat2 syscall generic/025 [not run] kernel doesn't support renameat2 syscall generic/026 [not run] ceph does not define maximum ACL count ....
Then I switched back to `5.19.0`, still the same but a little better:
[xfstests-dev]# sudo ./check -g quick -E ./ceph.exclude FSTYP -- ceph PLATFORM -- Linux/x86_64 lxbceph1 5.19.0+ #159 SMP PREEMPT_DYNAMIC Wed Aug 3 17:09:18 CST 2022 MKFS_OPTIONS -- 10.72.47.117:40643:/testB MOUNT_OPTIONS -- -o name=admin,nowsync,copyfrom,rasize=4096 -o context=system_u:object_r:root_t:s0 10.72.47.117:40643:/testB /mnt/kcephfs.B ceph/001 [expunged] ceph/002 169s ... 152s ceph/003 184s ... 167s ceph/004 171s ... 148s ceph/005 166s ... 148s generic/001 327s ... 263s generic/002 174s ... 152s generic/003 [expunged] generic/004 [not run] O_TMPFILE is not supported generic/005 167s ... 150s
#14 Updated by Xiubo Li over 1 year ago
- Status changed from New to In Progress
#15 Updated by Xiubo Li over 1 year ago
[root@lxbceph1 xfstests-dev]# ./check g quick -E ./ceph.exclude ceph
FSTYP -
PLATFORM -- Linux/x86_64 lxbceph1 6.1.0+ #186 SMP PREEMPT_DYNAMIC Tue Dec 13 10:45:30 CST 2022
MKFS_OPTIONS -- 10.72.47.117:40986:/testB
MOUNT_OPTIONS -- -o name=admin,nowsync,copyfrom,rasize=4096 -o context=system_u:object_r:root_t:s0 10.72.47.117:40986:/testB /mnt/kcephfs.B
ceph/001 [expunged]
ceph/002 180s ... 9s
ceph/003 198s ... 23s
ceph/004 169s ... 7s
ceph/005 170s ... 7s
generic/001 332s ... 157s
generic/002 178s ... 12s
generic/003 [expunged]
generic/004 [not run] O_TMPFILE is not supported
generic/005 178s ... 9s
generic/006 218s ... 49s
generic/007 318s ... 130s
generic/008 [not run] xfs_io fzero failed (old kernel/wrong fs?)
generic/009 [not run] xfs_io fzero failed (old kernel/wrong fs?)
generic/011 231s ... 148s
generic/012 [not run] xfs_io falloc failed (old kernel/wrong fs?)
generic/013 449s ...
After rebased to the Linux 6.1 I see it became faster by testing the testing branch, without any change for ceph comparing with the last test with Linux 6.1.0-rc1.
So this is mostly not the kceph's issue.
#16 Updated by Minjong Kim over 1 year ago
Xiubo Li wrote:
[root@lxbceph1 xfstests-dev]# ./check
g quick -E ./ceph.excludeceph
FSTYP -
PLATFORM -- Linux/x86_64 lxbceph1 6.1.0+ #186 SMP PREEMPT_DYNAMIC Tue Dec 13 10:45:30 CST 2022
MKFS_OPTIONS -- 10.72.47.117:40986:/testB
MOUNT_OPTIONS -- -o name=admin,nowsync,copyfrom,rasize=4096 -o context=system_u:object_r:root_t:s0 10.72.47.117:40986:/testB /mnt/kcephfs.Bceph/001 [expunged]
ceph/002 180s ... 9s
ceph/003 198s ... 23s
ceph/004 169s ... 7s
ceph/005 170s ... 7s
generic/001 332s ... 157s
generic/002 178s ... 12s
generic/003 [expunged]
generic/004 [not run] O_TMPFILE is not supported
generic/005 178s ... 9s
generic/006 218s ... 49s
generic/007 318s ... 130s
generic/008 [not run] xfs_io fzero failed (old kernel/wrong fs?)
generic/009 [not run] xfs_io fzero failed (old kernel/wrong fs?)
generic/011 231s ... 148s
generic/012 [not run] xfs_io falloc failed (old kernel/wrong fs?)
generic/013 449s ...After rebased to the Linux 6.1 I see it became faster by testing the testing branch, without any change for ceph comparing with the last test with Linux 6.1.0-rc1.
So this is mostly not the kceph's issue.
It seems so to me too