Bug #45532
cephfs kernel client rmdir degradation/regression in kernels 5.1 and higher
0%
Description
Hi,
a simple test script (testdirs.sh), demonstrating the problem described below, creating 10k files in ~1000 directories:
------
mkdir test for i in `seq 1 10` ; do mkdir test/$i for j in `seq 1 10` ; do mkdir test/$i/$j for k in `seq 1 10` ; do mkdir test/$i/$j/$k for l in `seq 1 10` ; do echo "foo" > test/$i/$j/$k/$l done done done done
--------
The problem: running the script above on one host and "rm -rf test" on the other host is extremely slow if the directories and files were created on >= 5.1.* linux kernels, including 5.7.0-rc5, and it takes hours with many directories.
Some timings, create and rm always on different hosts:
1) create on 4.19.115, rm on 5.7.0-rc5 (nowsync mounted):
$ time bash testdirs.sh
real 0m18.682s
user 0m1.557s
sys 0m4.099s
$ time rm -rf test
real 1m15.084s
user 0m0.072s
sys 0m1.297s
2) create on 4.9.115, rm on 4.9.115
$ time bash testdirs.sh
real 0m21.444s
user 0m1.571s
sys 0m4.122s
$ time rm -rf test
real 0m56.503s
user 0m0.014s
sys 0m0.166s
3) create on 5.7.0-rc5, rm on 4.9.115
$ time bash testdirs.sh
real 0m7.870s
user 0m1.204s
sys 0m2.307s
$ time rm -rf test
hours....
interrupted, ~250 files removed in real 2m12.912s
extrapolated to 5300 seconds to fully remove the tree.
removing the rest on the same host that created the files is fast and takes real 0m38.193s
4) create on 5.7.0-rc5, rm on 5.7.0-rc5
same as 3)
The same issue was tested with other kernels. Up to kernel 5.0.* everything is fine, from 5.1.* all the clients have the issue with rmdir.
The culprit seems to be related to very long time to obtain an exclusive lock on each directory, taking up to 5 seconds each, that was created on 5.1 or higher kernels, as seen in mds dump_ops_in_flight and on mdsc kernel client as well:
...
{
"time": "2020-05-13T13:20:46.687464+0200",
"event": "failed to xlock, waiting"
}
...
$ cat /sys/kernel/debug/ceph/*/mdsc
46013 mds0 rmdir #10036bf77be/4 (tmp/k/test/8/3/4)
There seems to be no degradation in unlinking files:
on 5.7.0-rc5
$ mkdir test; time for i in `seq 1 10000` ; do echo "foo" > test/$i ; done
real 0m1.657s
user 0m0.158s
sys 0m0.205s
on 4.19.115:
$ time rm -rf test
real 0m46.566s
user 0m0.000s
sys 0m0.123s
This problem heavily affects central services doing data management and cleanup of scratch space that was used by computing and data processing nodes. The only workaround for now is to stick with 4.19 or 5.0 kernels, or perform a parallel rm on each subdirectory.
The problem was observed on mimic, nautilus and lately on octopus.
History
#1 Updated by Andrej Filipcic almost 4 years ago
Much simpler test case:
$ cat /ceph/tmp/k/testmkdirrmdir.sh for i in `seq 1 10`; do #ssh f9pc45 mkdir /ceph/tmp/k/a ssh f9pc23 mkdir /ceph/tmp/k/a time rmdir /ceph/tmp/k/a done # ssh to 4.19 kernel host $ bash /ceph/tmp/k/testmkdirrmdir.sh 2>&1 |grep real real 0m0.013s real 0m0.017s real 0m0.009s real 0m0.009s real 0m0.008s real 0m0.222s real 0m0.010s real 0m0.012s real 0m0.012s real 0m0.011s # ssh to 5.7-rc5 kernel host $ bash /ceph/tmp/k/testmkdirrmdir.sh 2>&1 |grep real real 0m4.885s real 0m6.011s real 0m5.514s real 0m5.004s real 0m5.003s real 0m6.030s real 0m5.512s real 0m5.000s real 0m5.008s real 0m6.024s
Looks like mkdir implementation on newer kernels leaves problematic locks around.
#2 Updated by Patrick Donnelly almost 4 years ago
- Status changed from New to Triaged
- Assignee set to Jeff Layton
#3 Updated by Jeff Layton almost 4 years ago
Ok, I reproduced this today and it seems like it's just that when an inode is created, we generally give a full set of caps to that client. When you go to remove them from another client the MDS has to go and individually revoke them.
AFAICT, the creating client releases the caps in a timely fashion, but the revoke requests come in batches from the MDS in ~5s intervals. To me, that suggests that these are being batched up and sent out with the MDS tick (as the tick interval defaults to 5s).
I haven't given the earlier clients a test, but I suspect they were just proactively releasing caps that they didn't need? That needs to be confirmed however.
#4 Updated by Andrej Filipcic almost 4 years ago
To me it seems it's only related to the directories. This file test is fast enough, and it does not show any difference between old and recent kernels (nowsync is disabled):
for i in `seq 1 10`; do ssh f9pc45 "echo foo > /ceph/tmp/k/afile" time rm /ceph/tmp/k/afile done
f9pc08 /ceph/tmp/k # bash testfilermfile.sh 2>&1|grep real real 0m0.055s real 0m0.043s real 0m0.282s real 0m0.251s real 0m0.246s real 0m0.262s real 0m0.317s real 0m0.070s real 0m0.252s real 0m0.031s
The main issue is that this 5s delay remains forever, eg hours or more, even if the client that created the dirs umounts cephfs.
The comparison with creating files:
# time ( for i in `seq 1 1000` ; do echo $i > $i ; done ) real 0m2.271s user 0m0.015s sys 0m0.019s
remove on the same client:
# time rm -f * real 0m1.803s user 0m0.000s sys 0m0.013s
remove on a diffenet client:
# time rm -f * real 0m33.248s user 0m0.010s sys 0m0.132s
Could be faster (probably is in async mode), but there is no real difference between the kernels here.
So, there must be some caps that are never released on the dentries on new kernels. I enabled the libceph/ceph module debugging, but I was lost in the many calls and messages. If you can give me a hint on what caps I should check, I might be able to pin the difference between the old and new client behavior while handling the dirs.
#5 Updated by Jeff Layton almost 4 years ago
Andrej Filipcic wrote:
The main issue is that this 5s delay remains forever, eg hours or more, even if the client that created the dirs umounts cephfs.
That sounds really odd and I didn't see that in my testing. When the creating client unmounted the cephfs, its state was revoked and the rm proceeded much more quickly.
In any case, now that I'm looking again, I suspect the client is delaying the cap releases until its workqueue job runs. This patch went in between v5.0 and v5.1:
commit e3ec8d6898f71636a067dae683174ef9bf81bc96 Author: Yan, Zheng <zyan@redhat.com> Date: Mon Jan 14 17:21:19 2019 +0800 ceph: send cap releases more aggressively When pending cap releases fill up one message, start a work to send cap release message. (old way is sending cap releases every 5 seconds) Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Though it seems a lot like we're still getting the "old way".
#6 Updated by Andrej Filipcic almost 4 years ago
I have tried the commit e3ec8d6898f71636a067dae683174ef9bf81bc96 on 5.0.21 kernel, where it applies cleanly, and it works fine with it. No slowdown observed, so it must be something else.
I was wrong on umount, it releases the caps and rm is then fast.
There is something more strange with dirs:
f9pc45 /ceph/tmp/k # mkdir a f9pc45 /ceph/tmp/k # cd a f9pc45 /ceph/tmp/k/a # for i in `seq 1 10` ;do mkdir $i ; done f9pc08 /ceph/tmp/k # time rm -rf a/* real 0m0.517s user 0m0.001s sys 0m0.000s f9pc45 /ceph/tmp/k/a # for i in `seq 1 10` ;do mkdir $i ; done f9pc08 /ceph/tmp/k # time rm -rf a/* real 0m51.210s user 0m0.000s sys 0m0.002s
Creating parent dir, then creating subdirs, the subdirs are removed fast. Creating subdirs once more with the same parent, the subdirs are removed very slowly. And this is consistent while retrying.
#7 Updated by Jeff Layton almost 4 years ago
Ok, if you feel ambitious, there is a function in fs/ceph/mds_client.c called schedule_delayed(). That requeues the delayed workqueue job for every 5s. Try reducing the delay to 1s instead. Does that change the behavior significantly?
Another idea (a little more time consuming) would be to run a git bisect between v5.0 and v5.1 kernels. That should help nail down the change that regressed things. You should be able to limit it to commits that touch fs/ceph directory in the kernel sources.
#8 Updated by Andrej Filipcic almost 4 years ago
This one is problematic:
commit 1e9c2eb6811e8e017b589b483f6ff2b7c065eef5 Author: Yan, Zheng <zyan@redhat.com> Date: Mon Jan 28 20:43:55 2019 +0800 ceph: delete stale dentry when last reference is dropped introduce ceph_d_delete(), which checks if dentry has valid lease. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> fs/ceph/dir.c | 130 +++++++++++++++++++++++++++++++++++++++------------ fs/ceph/inode.c | 2 +- fs/ceph/mds_client.c | 2 +- fs/ceph/super.h | 2 +- 4 files changed, 104 insertions(+), 32 deletions(-)
#9 Updated by Jeff Layton almost 4 years ago
Thanks for running the bisect. I can confirm that if I set the d_delete op to NULL, that this problem goes away. That said, it's not immediately obvious to me why this is occurring so I'll need to do a bit more investigation.
#10 Updated by Jeff Layton almost 4 years ago
Ok, threw in a little printk debugging and it looks like the lease generations are not matching up like I'd expect. Small debug patch:
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 39f5311404b0..1ef77f68fbf5 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1616,10 +1616,18 @@ static int __dir_lease_try_check(const struct dentry *dentry) ci = ceph_inode(dir); if (spin_trylock(&ci->i_ceph_lock)) { - if (atomic_read(&ci->i_shared_gen) == di->lease_shared_gen && + int shared_gen = atomic_read(&ci->i_shared_gen); + int issued = __ceph_caps_issued(ci, NULL); + + if (shared_gen == di->lease_shared_gen && __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 0)) valid = 1; spin_unlock(&ci->i_ceph_lock); + if (!valid) { + printk("%s: invalid shared_gen=%d lease_gen=%d mask=%s mode=0%o\n", + __func__, shared_gen, di->lease_shared_gen, ceph_cap_string(issued), d_inode(dentry)->i_mode); + dump_stack(); + } } else { valid = -EBUSY; }
...and that generates some messages in the logs like this:
[ 2888.470033] __dir_lease_try_check: invalid shared_gen=6 lease_gen=2 mask=pAsLsXs mode=040775 [ 2888.471799] CPU: 5 PID: 20419 Comm: kworker/5:1 Kdump: loaded Tainted: G OE 5.7.0-rc5+ #44 [ 2888.473647] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [ 2888.475319] Workqueue: ceph-msgr ceph_con_workfn [libceph] [ 2888.476392] Call Trace: [ 2888.477110] dump_stack+0x64/0x88 [ 2888.477952] __dir_lease_try_check+0xbf/0x100 [ceph] [ 2888.478997] ceph_d_delete+0x49/0x60 [ceph] [ 2888.479937] dput+0x156/0x350 [ 2888.480721] d_prune_aliases+0x57/0x120 [ 2888.481641] handle_cap_grant+0x392/0xbf0 [ceph] [ 2888.482647] ceph_handle_caps+0xab0/0x1840 [ceph] [ 2888.483693] dispatch+0x285/0x1430 [ceph] [ 2888.484630] ? inet_recvmsg+0x4d/0xd0 [ 2888.485532] ? ceph_tcp_recvmsg+0x62/0x80 [libceph] [ 2888.486577] ? calc_signature+0xe4/0x110 [libceph] [ 2888.487614] ceph_con_workfn+0x19f7/0x2800 [libceph] [ 2888.488660] ? __switch_to_asm+0x40/0x70 [ 2888.489566] ? __switch_to_asm+0x34/0x70 [ 2888.490450] ? __switch_to_asm+0x40/0x70 [ 2888.491314] ? __switch_to_asm+0x34/0x70 [ 2888.492182] ? __switch_to_asm+0x40/0x70 [ 2888.493047] ? __switch_to_asm+0x40/0x70 [ 2888.493909] ? __switch_to+0x152/0x420 [ 2888.494733] ? __switch_to_asm+0x34/0x70 [ 2888.495584] process_one_work+0x1b4/0x380 [ 2888.496433] worker_thread+0x53/0x3e0 [ 2888.497232] ? process_one_work+0x380/0x380 [ 2888.498132] kthread+0x115/0x140 [ 2888.498942] ? __kthread_bind_mask+0x60/0x60 [ 2888.499905] ret_from_fork+0x22/0x40
#11 Updated by Jeff Layton almost 4 years ago
Ok, this patch (suggested by Zheng) seems to fix it.
Andrej, can you test it out and confirm whether it fixes it for you as well? Also, let me know if I got your email address wrong. If this looks good, I'll post it later today and we'll get it into the testing branch (and hopefully to Linus soon).
#12 Updated by Andrej Filipcic almost 4 years ago
Yes, it fixes the issue for me. Fast now. I have tested it on 5.6.13 kernel. You got the email right.
Many thanks for help and a quick fix.
#13 Updated by Andrej Filipcic almost 4 years ago
Hi,
I have checked 5.6.15 kernel patches, but this fix does not seem to be there yet?
Cheers,
Andrej
#14 Updated by Jeff Layton almost 4 years ago
Andrej Filipcic wrote:
I have checked 5.6.15 kernel patches, but this fix does not seem to be there yet?
Correct. There'll be one more ceph pull request for v5.7 and the patch for this should be in it. Once that goes in, the stable series kernels should pick it up soon afterward.
#15 Updated by Jeff Layton almost 4 years ago
- Status changed from Triaged to Resolved
The patch for this went into v5.7 and should trickle out to v5.4 and v5.6 stable series kernels soon.