Project

General

Profile

Bug #45532

cephfs kernel client rmdir degradation/regression in kernels 5.1 and higher

Added by Andrej Filipcic almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

a simple test script (testdirs.sh), demonstrating the problem described below, creating 10k files in ~1000 directories:
------

mkdir test
for i in `seq 1 10` ; do
  mkdir test/$i
  for j in `seq 1 10` ; do
    mkdir test/$i/$j
    for k in `seq 1 10` ; do
      mkdir test/$i/$j/$k 
      for l in `seq 1 10` ; do
        echo "foo" > test/$i/$j/$k/$l
      done
    done
  done
done

--------

The problem: running the script above on one host and "rm -rf test" on the other host is extremely slow if the directories and files were created on >= 5.1.* linux kernels, including 5.7.0-rc5, and it takes hours with many directories.

Some timings, create and rm always on different hosts:

1) create on 4.19.115, rm on 5.7.0-rc5 (nowsync mounted):

$ time bash testdirs.sh
real 0m18.682s
user 0m1.557s
sys 0m4.099s

$ time rm -rf test
real 1m15.084s
user 0m0.072s
sys 0m1.297s

2) create on 4.9.115, rm on 4.9.115

$ time bash testdirs.sh
real 0m21.444s
user 0m1.571s
sys 0m4.122s

$ time rm -rf test
real 0m56.503s
user 0m0.014s
sys 0m0.166s

3) create on 5.7.0-rc5, rm on 4.9.115

$ time bash testdirs.sh
real 0m7.870s
user 0m1.204s
sys 0m2.307s

$ time rm -rf test
hours....
interrupted, ~250 files removed in real 2m12.912s
extrapolated to 5300 seconds to fully remove the tree.
removing the rest on the same host that created the files is fast and takes real 0m38.193s

4) create on 5.7.0-rc5, rm on 5.7.0-rc5
same as 3)

The same issue was tested with other kernels. Up to kernel 5.0.* everything is fine, from 5.1.* all the clients have the issue with rmdir.

The culprit seems to be related to very long time to obtain an exclusive lock on each directory, taking up to 5 seconds each, that was created on 5.1 or higher kernels, as seen in mds dump_ops_in_flight and on mdsc kernel client as well:
... {
"time": "2020-05-13T13:20:46.687464+0200",
"event": "failed to xlock, waiting"
}
...
$ cat /sys/kernel/debug/ceph/*/mdsc
46013 mds0 rmdir #10036bf77be/4 (tmp/k/test/8/3/4)

There seems to be no degradation in unlinking files:

on 5.7.0-rc5
$ mkdir test; time for i in `seq 1 10000` ; do echo "foo" > test/$i ; done
real 0m1.657s
user 0m0.158s
sys 0m0.205s

on 4.19.115:
$ time rm -rf test
real 0m46.566s
user 0m0.000s
sys 0m0.123s

This problem heavily affects central services doing data management and cleanup of scratch space that was used by computing and data processing nodes. The only workaround for now is to stick with 4.19 or 5.0 kernels, or perform a parallel rm on each subdirectory.

The problem was observed on mimic, nautilus and lately on octopus.

0001-ceph-flush-release-queue-when-handling-caps-for-unkn.patch View - proposed patch (1.42 KB) Jeff Layton, 05/20/2020 02:46 PM

History

#1 Updated by Andrej Filipcic almost 4 years ago

Much simpler test case:

$ cat /ceph/tmp/k/testmkdirrmdir.sh
for i in `seq 1 10`; do
   #ssh f9pc45 mkdir /ceph/tmp/k/a
   ssh f9pc23 mkdir /ceph/tmp/k/a
   time rmdir /ceph/tmp/k/a
done

# ssh to 4.19 kernel host
$ bash /ceph/tmp/k/testmkdirrmdir.sh 2>&1 |grep real
real    0m0.013s
real    0m0.017s
real    0m0.009s
real    0m0.009s
real    0m0.008s
real    0m0.222s
real    0m0.010s
real    0m0.012s
real    0m0.012s
real    0m0.011s

# ssh to 5.7-rc5 kernel host
$ bash /ceph/tmp/k/testmkdirrmdir.sh 2>&1 |grep real
real    0m4.885s
real    0m6.011s
real    0m5.514s
real    0m5.004s
real    0m5.003s
real    0m6.030s
real    0m5.512s
real    0m5.000s
real    0m5.008s
real    0m6.024s

Looks like mkdir implementation on newer kernels leaves problematic locks around.

#2 Updated by Patrick Donnelly almost 4 years ago

  • Status changed from New to Triaged
  • Assignee set to Jeff Layton

#3 Updated by Jeff Layton almost 4 years ago

Ok, I reproduced this today and it seems like it's just that when an inode is created, we generally give a full set of caps to that client. When you go to remove them from another client the MDS has to go and individually revoke them.

AFAICT, the creating client releases the caps in a timely fashion, but the revoke requests come in batches from the MDS in ~5s intervals. To me, that suggests that these are being batched up and sent out with the MDS tick (as the tick interval defaults to 5s).

I haven't given the earlier clients a test, but I suspect they were just proactively releasing caps that they didn't need? That needs to be confirmed however.

#4 Updated by Andrej Filipcic almost 4 years ago

To me it seems it's only related to the directories. This file test is fast enough, and it does not show any difference between old and recent kernels (nowsync is disabled):

for i in `seq 1 10`; do
   ssh f9pc45 "echo foo >  /ceph/tmp/k/afile" 
   time rm /ceph/tmp/k/afile
done
f9pc08 /ceph/tmp/k # bash testfilermfile.sh 2>&1|grep real
real    0m0.055s
real    0m0.043s
real    0m0.282s
real    0m0.251s
real    0m0.246s
real    0m0.262s
real    0m0.317s
real    0m0.070s
real    0m0.252s
real    0m0.031s

The main issue is that this 5s delay remains forever, eg hours or more, even if the client that created the dirs umounts cephfs.

The comparison with creating files:

# time ( for i in `seq 1 1000` ; do echo $i > $i ; done )

real    0m2.271s
user    0m0.015s
sys     0m0.019s

remove on the same client:

# time rm -f *

real    0m1.803s
user    0m0.000s
sys     0m0.013s

remove on a diffenet client:
# time rm -f *

real    0m33.248s
user    0m0.010s
sys     0m0.132s

Could be faster (probably is in async mode), but there is no real difference between the kernels here.

So, there must be some caps that are never released on the dentries on new kernels. I enabled the libceph/ceph module debugging, but I was lost in the many calls and messages. If you can give me a hint on what caps I should check, I might be able to pin the difference between the old and new client behavior while handling the dirs.

#5 Updated by Jeff Layton almost 4 years ago

Andrej Filipcic wrote:

The main issue is that this 5s delay remains forever, eg hours or more, even if the client that created the dirs umounts cephfs.

That sounds really odd and I didn't see that in my testing. When the creating client unmounted the cephfs, its state was revoked and the rm proceeded much more quickly.

In any case, now that I'm looking again, I suspect the client is delaying the cap releases until its workqueue job runs. This patch went in between v5.0 and v5.1:

commit e3ec8d6898f71636a067dae683174ef9bf81bc96
Author: Yan, Zheng <zyan@redhat.com>
Date:   Mon Jan 14 17:21:19 2019 +0800

    ceph: send cap releases more aggressively

    When pending cap releases fill up one message, start a work to send
    cap release message. (old way is sending cap releases every 5 seconds)

    Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
    Reviewed-by: Jeff Layton <jlayton@redhat.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Though it seems a lot like we're still getting the "old way".

#6 Updated by Andrej Filipcic almost 4 years ago

I have tried the commit e3ec8d6898f71636a067dae683174ef9bf81bc96 on 5.0.21 kernel, where it applies cleanly, and it works fine with it. No slowdown observed, so it must be something else.

I was wrong on umount, it releases the caps and rm is then fast.

There is something more strange with dirs:

f9pc45 /ceph/tmp/k # mkdir a
f9pc45 /ceph/tmp/k # cd a
f9pc45 /ceph/tmp/k/a # for i in `seq 1 10` ;do mkdir $i ; done

f9pc08 /ceph/tmp/k # time rm -rf a/*

real    0m0.517s
user    0m0.001s
sys     0m0.000s

f9pc45 /ceph/tmp/k/a # for i in `seq 1 10` ;do mkdir $i ; done

f9pc08 /ceph/tmp/k # time rm -rf a/*

real    0m51.210s
user    0m0.000s
sys     0m0.002s

Creating parent dir, then creating subdirs, the subdirs are removed fast. Creating subdirs once more with the same parent, the subdirs are removed very slowly. And this is consistent while retrying.

#7 Updated by Jeff Layton almost 4 years ago

Ok, if you feel ambitious, there is a function in fs/ceph/mds_client.c called schedule_delayed(). That requeues the delayed workqueue job for every 5s. Try reducing the delay to 1s instead. Does that change the behavior significantly?

Another idea (a little more time consuming) would be to run a git bisect between v5.0 and v5.1 kernels. That should help nail down the change that regressed things. You should be able to limit it to commits that touch fs/ceph directory in the kernel sources.

#8 Updated by Andrej Filipcic almost 4 years ago

This one is problematic:

commit 1e9c2eb6811e8e017b589b483f6ff2b7c065eef5
Author: Yan, Zheng <zyan@redhat.com>
Date:   Mon Jan 28 20:43:55 2019 +0800

    ceph: delete stale dentry when last reference is dropped

    introduce ceph_d_delete(), which checks if dentry has valid lease.

    Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
    Reviewed-by: Jeff Layton <jlayton@redhat.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

 fs/ceph/dir.c        | 130 +++++++++++++++++++++++++++++++++++++++------------
 fs/ceph/inode.c      |   2 +-
 fs/ceph/mds_client.c |   2 +-
 fs/ceph/super.h      |   2 +-
 4 files changed, 104 insertions(+), 32 deletions(-)

#9 Updated by Jeff Layton almost 4 years ago

Thanks for running the bisect. I can confirm that if I set the d_delete op to NULL, that this problem goes away. That said, it's not immediately obvious to me why this is occurring so I'll need to do a bit more investigation.

#10 Updated by Jeff Layton almost 4 years ago

Ok, threw in a little printk debugging and it looks like the lease generations are not matching up like I'd expect. Small debug patch:

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 39f5311404b0..1ef77f68fbf5 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1616,10 +1616,18 @@ static int __dir_lease_try_check(const struct dentry *dentry)
        ci = ceph_inode(dir);

        if (spin_trylock(&ci->i_ceph_lock)) {
-               if (atomic_read(&ci->i_shared_gen) == di->lease_shared_gen &&
+               int shared_gen = atomic_read(&ci->i_shared_gen);
+               int issued = __ceph_caps_issued(ci, NULL);
+
+               if (shared_gen == di->lease_shared_gen &&
                    __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 0))
                        valid = 1;
                spin_unlock(&ci->i_ceph_lock);
+               if (!valid) {
+                       printk("%s: invalid shared_gen=%d lease_gen=%d mask=%s mode=0%o\n",
+                               __func__, shared_gen, di->lease_shared_gen, ceph_cap_string(issued), d_inode(dentry)->i_mode);
+                       dump_stack();
+               }
        } else {
                valid = -EBUSY;
        }

...and that generates some messages in the logs like this:

[ 2888.470033] __dir_lease_try_check: invalid shared_gen=6 lease_gen=2 mask=pAsLsXs mode=040775
[ 2888.471799] CPU: 5 PID: 20419 Comm: kworker/5:1 Kdump: loaded Tainted: G           OE     5.7.0-rc5+ #44
[ 2888.473647] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[ 2888.475319] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 2888.476392] Call Trace:
[ 2888.477110]  dump_stack+0x64/0x88
[ 2888.477952]  __dir_lease_try_check+0xbf/0x100 [ceph]
[ 2888.478997]  ceph_d_delete+0x49/0x60 [ceph]
[ 2888.479937]  dput+0x156/0x350
[ 2888.480721]  d_prune_aliases+0x57/0x120
[ 2888.481641]  handle_cap_grant+0x392/0xbf0 [ceph]
[ 2888.482647]  ceph_handle_caps+0xab0/0x1840 [ceph]
[ 2888.483693]  dispatch+0x285/0x1430 [ceph]
[ 2888.484630]  ? inet_recvmsg+0x4d/0xd0
[ 2888.485532]  ? ceph_tcp_recvmsg+0x62/0x80 [libceph]
[ 2888.486577]  ? calc_signature+0xe4/0x110 [libceph]
[ 2888.487614]  ceph_con_workfn+0x19f7/0x2800 [libceph]
[ 2888.488660]  ? __switch_to_asm+0x40/0x70
[ 2888.489566]  ? __switch_to_asm+0x34/0x70
[ 2888.490450]  ? __switch_to_asm+0x40/0x70
[ 2888.491314]  ? __switch_to_asm+0x34/0x70
[ 2888.492182]  ? __switch_to_asm+0x40/0x70
[ 2888.493047]  ? __switch_to_asm+0x40/0x70
[ 2888.493909]  ? __switch_to+0x152/0x420
[ 2888.494733]  ? __switch_to_asm+0x34/0x70
[ 2888.495584]  process_one_work+0x1b4/0x380
[ 2888.496433]  worker_thread+0x53/0x3e0
[ 2888.497232]  ? process_one_work+0x380/0x380
[ 2888.498132]  kthread+0x115/0x140
[ 2888.498942]  ? __kthread_bind_mask+0x60/0x60
[ 2888.499905]  ret_from_fork+0x22/0x40

#11 Updated by Jeff Layton almost 4 years ago

Ok, this patch (suggested by Zheng) seems to fix it.

Andrej, can you test it out and confirm whether it fixes it for you as well? Also, let me know if I got your email address wrong. If this looks good, I'll post it later today and we'll get it into the testing branch (and hopefully to Linus soon).

#12 Updated by Andrej Filipcic almost 4 years ago

Yes, it fixes the issue for me. Fast now. I have tested it on 5.6.13 kernel. You got the email right.

Many thanks for help and a quick fix.

#13 Updated by Andrej Filipcic almost 4 years ago

Hi,

I have checked 5.6.15 kernel patches, but this fix does not seem to be there yet?

Cheers,
Andrej

#14 Updated by Jeff Layton almost 4 years ago

Andrej Filipcic wrote:

I have checked 5.6.15 kernel patches, but this fix does not seem to be there yet?

Correct. There'll be one more ceph pull request for v5.7 and the patch for this should be in it. Once that goes in, the stable series kernels should pick it up soon afterward.

#15 Updated by Jeff Layton almost 4 years ago

  • Status changed from Triaged to Resolved

The patch for this went into v5.7 and should trickle out to v5.4 and v5.6 stable series kernels soon.

Also available in: Atom PDF