Bug #44100
closedcephfs rsync kworker high load.
0%
Description
I have an rsync backup running which became 10 hours. When I test with rsync one instance, it looks like it processes quite few files fast, then 20 minutes nothing happens and then processes 4 files and finishes correctly (without rsync errors)
I already posted some related questions on the mailing list
https://www.mail-archive.com/ceph-users@ceph.io/msg01718.html
echo l > /proc/sysrq-trigger
[6310351.557432] CPU: 5 PID: 3453020 Comm: kworker/5:1 Not tainted 3.10.0-1062.4.3.el7.x86_64 #1
[6310351.557433] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
[6310351.557435] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[6310351.557436] task: ffffa0d29fbe20e0 ti: ffffa0c697b48000 task.ti: ffffa0c697b48000
[6310351.557437] RIP: 0010:[<ffffffffc08d7f70>] [<ffffffffc08d7f70>] build_snap_context+0x20/0x290 [ceph]
[6310351.557438] RSP: 0018:ffffa0c697b4baa8 EFLAGS: 00000292
[6310351.557439] RAX: ffffa0c699949188 RBX: ffffa0c699949100 RCX: 00000000fffffff8
[6310351.557440] RDX: ffffa0c697b4bb68 RSI: ffffa0c697b4bb68 RDI: ffffa0c699949100
[6310351.557441] RBP: ffffa0c697b4bad8 R08: ffffa0c699948e00 R09: 0000000000027247
[6310351.557442] R10: 0000000000000036 R11: 0000000000000000 R12: ffffa0c697b4bb68
[6310351.557443] R13: ffffa0c697b4bb68 R14: ffffa0cda3a37b00 R15: ffffa0cda3a37b00
[6310351.557444] FS: 0000000000000000(0000) GS:ffffa0d2cfb40000(0000) knlGS:0000000000000000
[6310351.557445] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6310351.557446] CR2: 00007fae085ac000 CR3: 0000001813010000 CR4: 00000000000627e0
[6310351.557447] Call Trace:
[6310351.557448] [<ffffffffc08d820c>] rebuild_snap_realms+0x2c/0x90 [ceph]
[6310351.557449] [<ffffffffc08d822b>] rebuild_snap_realms+0x4b/0x90 [ceph]
[6310351.557450] [<ffffffffc08d91fc>] ceph_update_snap_trace+0x3ec/0x530 [ceph]
[6310351.557452] [<ffffffffc08e2239>] handle_reply+0x359/0xc60 [ceph]
[6310351.557453] [<ffffffffc08e48ba>] dispatch+0x11a/0xb00 [ceph]
[6310351.557454] [<ffffffffb042e56a>] ? kernel_recvmsg+0x3a/0x50
[6310351.557455] [<ffffffffc05fcff4>] try_read+0x544/0x1300 [libceph]
[6310351.557456] [<ffffffffafee13ce>] ? account_entity_dequeue+0xae/0xd0
[6310351.557457] [<ffffffffafee4d5c>] ? dequeue_entity+0x11c/0x5e0
[6310351.557458] [<ffffffffb042e417>] ? kernel_sendmsg+0x37/0x50
[6310351.557459] [<ffffffffc05fdfb4>] ceph_con_workfn+0xe4/0x1530 [libceph]
[6310351.557460] [<ffffffffb057f568>] ? __schedule+0x448/0x9c0
[6310351.557461] [<ffffffffafebe21f>] process_one_work+0x17f/0x440
[6310351.557462] [<ffffffffafebf336>] worker_thread+0x126/0x3c0
[6310351.557463] [<ffffffffafebf210>] ? manage_workers.isra.26+0x2a0/0x2a0
[6310351.557464] [<ffffffffafec61f1>] kthread+0xd1/0xe0
[6310351.557465] [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
[6310351.557466] [<ffffffffb058cd37>] ret_from_fork_nospec_begin+0x21/0x21
[6310351.557467] [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
[6310351.557469] Code: 41 5e 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 4c 8b 7f 70 <44> 8b 6f 68 44 03 6f 58 4d 85 ff 74 14 49 8b 87 b8 00 00 00 48
Updated by none none about 4 years ago
Did another test rsyncing 2 files, 16 minutes???
- concurr. link backup test2 ###
- start:15:06:13 ###
rsync -avH --del --one-file-system --exclude-from=/home/backup//excluded-dirs --include-from=/home/backup//included-dirs --timeout=0 --ignore-errors --numeric-ids --inplace -e "ssh -o Compression=no -x -o StrictHostKeyChecking=no -b 192.168.10.57" --link-dest=../4/ test2:/ test2:/home test2/5/
receiving incremental file list
./
var/lib/chrony/
var/lib/chrony/drift
var/lib/rsyslog/
var/lib/rsyslog/imjournal.state
var/log/sa/sa12
total size is 29,074,357,908 speedup is 1,063.05
- end:15:22:40 ###
Updated by none none about 4 years ago
PS. this kworker is only appearing with the 2nd rsync process. I already have a rsync session running copying from different source. With that process I do not see kworker spawning.
Updated by none none about 4 years ago
Maybe it is related to renewing capabilities? Or is it normal that the mds is asked to renew 132k caps so often?
{
"id": 3736134,
"num_leases": 0,
"num_caps": 132103,
"state": "open",
"request_load_avg": 31901,
"uptime": 875590.07943545096,
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.3736134 v1:192.168.10.114:0/696898030",
"client_metadata": {
"features": "00000000000000ff",
"entity_id": "cephfs.backup",
"hostname": "c04",
"kernel_version": "3.10.0-1062.4.3.el7.x86_64",
"root": "/backup"
}
}
2020-02-12 16:48:40.564 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325244) from client.3736134
2020-02-12 16:49:00.570 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325245) from client.3736134
2020-02-12 16:49:20.566 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325246) from client.3736134
2020-02-12 16:49:40.569 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325247) from client.3736134
2020-02-12 16:50:00.570 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325248) from client.3736134
2020-02-12 16:50:20.570 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325249) from client.3736134
2020-02-12 16:50:40.569 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325250) from client.3736134
Updated by none none about 4 years ago
none none wrote:
PS. this kworker is only appearing with the 2nd rsync process. I already have a rsync session running copying from different source. With that process I do not see kworker spawning.
With one rsync process it is the same also kworker high load
Updated by none none about 4 years ago
If I unmount the cephfs and mount it again, problems seems to be gone.
Updated by none none about 4 years ago
Tried to work around this by not using multiple cephfs mounts. First process finished quickly, 2nd had kworker high load again.
From this:
c01,c02,c03:/backup /home/backup ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
To this:
c01,c02,c03:/backup /home/backup ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
c01,c02,c03:/backup/h1 /home/backup/h1 ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
c01,c02,c03:/backup/h2 /home/backup/h2 ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
c01,c02,c03:/backup/h3 /home/backup/h3 ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
so concurrent rsync's have seperate clients at the mds.
Updated by Zheng Yan about 4 years ago
could you check if this still happen with upstream 5.5 kernel
Updated by none none about 4 years ago
Zheng Yan wrote:
could you check if this still happen with upstream 5.5 kernel
From where should I get it? elrepo? That is not really Redhat/Centos kernel not?
Updated by Jeff Layton about 4 years ago
none none wrote:
Zheng Yan wrote:
could you check if this still happen with upstream 5.5 kernel
From where should I get it? elrepo? That is not really Redhat/Centos kernel not?
No, it's not. You would probably need to build a kernel (or run this load on a newer distro like Fedora). Sometimes
The RHEL7 kernel lacks a lot of patches that have gone in upstream in the last few years. Running your test on something closer to mainline would tell us whether this is something already fixed, or whether we need to fix something upstream too.
We could also need some details about your snapshots. How many do you have?
Updated by none none about 4 years ago
I did not have time to install the 5.5 kernel yet on the rsync server. Today I noticed on a CentOS7 vm, with just 1 vcpu assigned, a similar behaviour. When another vm, was using the smb share, the kworker went up, stalling even the top session.
11001 root 20 0 0 0 0 R 89.1 0.0 0:22.16 kworker/0:1 11005 backupm+ 20 0 469172 15304 6720 D 5.9 1.5 0:03.05 smbd 6 root 20 0 0 0 0 R 1.0 0.0 1:51.03 ksoftirqd/0 9 root 20 0 0 0 0 S 1.0 0.0 0:15.92 rcu_sched 364 root 20 0 0 0 0 S 1.0 0.0 0:38.20 xfsaild/dm-0 10944 root 20 0 164120 2256 1560 R 1.0 0.2 0:02.71 top 1 root 20 0 43436 3248 1964 S 0.0 0.3 0:06.63 systemd
After installing the kernel-ml 5.5 there, the problem seemed to be much less, however I am getting these messages
[Mon Mar 9 23:19:57 2020] ceph: Can't lookup inode 1 (err: -13) [Mon Mar 9 23:21:54 2020] ceph: Can't lookup inode 1 (err: -13) [Mon Mar 9 23:28:59 2020] ceph: Can't lookup inode 1 (err: -13)
I assume these are just not printed when using the 3.10 kernel. Is this a caps issue/short coming of my cephfs user/id?
Updated by none none about 4 years ago
Not really solved it with the 5.5 kernel-ml on bare metal server
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1566 root 20 0 0 0 0 R 99.0 0.0 5:29.24 kworker/12:2+ce
Updated by Jonas Jelten over 3 years ago
I also encountered this on Ubuntu 20.04 Linux 5.4.0-60-generic #67-Ubuntu SMP Tue Jan 5 18:31:36 UTC 2021
The cluster is Nautilus, MDS are on 14.2.16.
[ +0.000000] Workqueue: ceph-msgr ceph_con_workfn [libceph] [ +0.000001] RIP: 0010:cmpu64_rev+0x1c/0x20 [ceph] [ +0.000001] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 8b 16 b8 01 00 00 00 48 89 e5 48 39 17 72 08 0f 97 c0 0f b6 c0 <f7> d8 5d c3 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 49 89 fd 41 > [ +0.000001] RSP: 0018:ffffb25b4867bb30 EFLAGS: 00000202 [ +0.000000] RAX: 0000000000000001 RBX: 0000000000000008 RCX: 0000000000000063 [ +0.000001] RDX: 0000000000000071 RSI: ffff9db0561a0a98 RDI: ffff9db0561a0a90 [ +0.000000] RBP: ffffb25b4867bb30 R08: 0000000000000138 R09: ffffffffc0456080 [ +0.000001] R10: 0000000000000d40 R11: 0000000000000001 R12: 0000000000000000 [ +0.000000] R13: 0000000000000278 R14: ffff9db0561a0818 R15: 0000000000000280 [ +0.000001] FS: 0000000000000000(0000) GS:ffff9db13ed40000(0000) knlGS:0000000000000000 [ +0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000001] CR2: 00007f293d11d004 CR3: 00000019c8e0a000 CR4: 0000000000340ee0 [ +0.000000] Call Trace: [ +0.000000] sort_r+0xbd/0x240 [ +0.000001] ? ceph_ino_compare+0x30/0x30 [ceph] [ +0.000000] sort+0xe/0x10 [ +0.000000] build_snap_context+0x126/0x260 [ceph] [ +0.000001] rebuild_snap_realms+0x28/0x80 [ceph] [ +0.000000] rebuild_snap_realms+0x42/0x80 [ceph] [ +0.000001] ceph_update_snap_trace+0x207/0x4e0 [ceph] [ +0.000000] handle_reply+0x23d/0x6a0 [ceph] [ +0.000000] dispatch+0xe9/0x610 [ceph] [ +0.000001] ? __switch_to_asm+0x34/0x70 [ +0.000000] ? __switch_to_asm+0x40/0x70 [ +0.000000] ? __switch_to_asm+0x34/0x70 [ +0.000000] ? __switch_to_asm+0x40/0x70 [ +0.000001] ? __switch_to_asm+0x34/0x70 [ +0.000000] try_read+0x566/0x8c0 [libceph] [ +0.000000] ? __switch_to_asm+0x40/0x70 [ +0.000001] ? __switch_to_asm+0x34/0x70 [ +0.000000] ? __switch_to_asm+0x40/0x70 [ +0.000000] ? __switch_to+0x7f/0x470 [ +0.000000] ? __switch_to_asm+0x40/0x70 [ +0.000001] ? __switch_to_asm+0x34/0x70 [ +0.000000] ceph_con_workfn+0x130/0x620 [libceph] [ +0.000000] process_one_work+0x1eb/0x3b0 [ +0.000001] worker_thread+0x4d/0x400 [ +0.000000] kthread+0x104/0x140 [ +0.000000] ? process_one_work+0x3b0/0x3b0 [ +0.000000] ? kthread_park+0x90/0x90 [ +0.000001] ret_from_fork+0x22/0x40
in ceph daemon mds.server dump snaps
161
are listed.
Please advise what to do :)
This client has really slow performance now (5% iops). After rebooting it, it's gone, but reappears shortly when the machine is "used" again.
Updated by Andras Sali about 3 years ago
We have also experienced a similar issue, where kernel mount performance degraded severely while doing rsync (running on CEPH 15.2.9 and a few hundred snapshots) and had to revert to using the FUSE mount for slower but more stable performance.
Others have reported in more detail similar issues: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EI2FDN7ELTAJ4SJUNRULYTA3ZEGOYSWU/
Updated by Andras Pataki over 2 years ago
We've recently started using cephfs snapshots and are running into a similar issue with the kernel client. It seems like traversing the file system and stat'ing files have become extremely slow. Some (not all) stat operations went from microseconds to hundreds of milliseconds in duration. The details are from a node that runs a file system scan:
The lstat calls of the file system scan take 200ms+, here is the strace -T -tt output:
07:05:28.909309 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp", {st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695> 07:05:29.182086 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp", {st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555> 07:05:29.450680 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp", {st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040> 07:05:29.720758 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp", {st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>
The file system scanning process is constantly stuck in 'D' state with /proc/<pid>/stack:
[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph] [<0>] ceph_mdsc_do_request+0x82/0x90 [ceph] [<0>] ceph_d_revalidate+0x207/0x300 [ceph] [<0>] lookup_fast+0x179/0x210 [<0>] walk_component+0x44/0x320 [<0>] path_lookupat+0x7b/0x220 [<0>] filename_lookup+0xa5/0x170 [<0>] vfs_statx+0x6e/0xd0 [<0>] __do_sys_newlstat+0x39/0x70 [<0>] do_syscall_64+0x4a/0xe0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
One of the kernel threads is using close to 100% of a single core constantly:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1350556 root 20 0 0 0 0 R 100.0 0.0 4:52.85 [kworker/18:0+ceph-msgr]
Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows typically waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A couple of examples:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607c50 EFLAGS: 00000246 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de44ede7e8 RCX: 0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de44ede7f8 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: ffff88de0a43ca00 R08: ffff88dec0c63120 R09: ffff88de0a43ca00 Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de36aab878 R12: ffff88de1f3cab00 Dec 10 07:36:56 popeye-mgr-0-10 kernel: R13: ffff88de44edeb18 R14: ffff88de1f3cabc0 R15: ffff88de44edeb18 Dec 10 07:36:56 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 10 07:36:56 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0 Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Dec 10 07:36:56 popeye-mgr-0-10 kernel: PKRU: 55555554 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Call Trace: Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_queue_cap_snap+0x3e/0x3e0 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x71/0xf0 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300 Dec 10 07:36:56 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310 Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370 Dec 10 07:36:56 popeye-mgr-0-10 kernel: kthread+0x10b/0x130 Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60 Dec 10 07:36:56 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30 Dec 10 07:40:17 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa Dec 10 07:40:17 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607ca0 EFLAGS: 00000246 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de62112ef0 RCX: 0000027f51c57994 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de455b4cd0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RBP: ffff88de62112bc0 R08: 0000000000000000 R09: ffff88dbf2742600 Dec 10 07:40:17 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de62112d88 R12: ffff88de455b4c00 Dec 10 07:40:17 popeye-mgr-0-10 kernel: R13: ffff88de62112ef0 R14: ffff88de455b4cc0 R15: ffff88de455b4cd0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000 Dec 10 07:40:17 popeye-mgr-0-10 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 10 07:40:17 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Dec 10 07:40:17 popeye-mgr-0-10 kernel: PKRU: 55555554 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Call Trace: Dec 10 07:40:17 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x79/0xf0 [ceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300 Dec 10 07:40:17 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310 Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370 Dec 10 07:40:17 popeye-mgr-0-10 kernel: kthread+0x10b/0x130 Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60 Dec 10 07:40:17 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30
Turning on kernel debugging for fs/ceph/snap.c, there seems this loop in queue_realm_cap_snaps is really long for some inodes:
list_for_each_entry(ci, &realm->inodes_with_caps, i_snap_realm_item) {
...
}
[3199649.882973] ceph: queue_realm_cap_snaps 00000000d313f1e4 1001af8c8e7 inodes [3199649.882974] ceph: queue_cap_snap 00000000171f11a7 nothing dirty|writing [3199649.882975] ceph: queue_cap_snap 00000000938b9cd2 nothing dirty|writing [3199649.882976] ceph: queue_cap_snap 00000000615cf4dd nothing dirty|writing [3199649.882977] ceph: queue_cap_snap 000000000027e295 nothing dirty|writing [3199649.882979] ceph: queue_cap_snap 00000000ba18b2f8 nothing dirty|writing [3199649.882980] ceph: queue_cap_snap 000000007c9c80de nothing dirty|writing [3199649.882981] ceph: queue_cap_snap 00000000629b4b0e nothing dirty|writing [3199649.882982] ceph: queue_cap_snap 00000000ab330b37 nothing dirty|writing [3199649.882983] ceph: queue_cap_snap 00000000c7dbc320 nothing dirty|writing [3199649.882985] ceph: queue_cap_snap 0000000070a0598f nothing dirty|writing [3199649.882986] ceph: queue_cap_snap 00000000915b9e2e nothing dirty|writing ... (and a lot lot more of these) ...
At this point the client has about a million caps (running up against the default cap limit) - so potentially this loop is over all the caps (?), which could mean tens/hundreds of milliseconds? Indeed, reducing mds_max_caps_per_client by an order of magnitude does improve the lstat times by about an order of magnitude (which is still pretty slow - but supports this hypothesis).
The ceph cluster is Nautilus 14.2.20. There are a total of 7 snapshots in cephfs, all taken at the root of the cephfs tree (a rolling set of 7 previous daily snapshots).
I've tested this with a few kernels: two LTS ones, and one more recent stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result. Any ideas/suggestions?
Updated by Jeff Layton about 2 years ago
Some thoughts...
This is a pretty nasty problem, in that the MDS is essentially handing the client an unbounded amount of work to so. Worse, it currently does this in the receive thread, so this effectively blocks any receive off the socket until it completes.
One thing we could do here is simply offload this work to a workqueue job. I don't think we necessarily need to do all of this synchronously as part of handling a reply.
That said, the comment at the top of ceph_update_snap_trace indicates that you need to hold the snap_rwsem for write, which is a large coarse-grained mutex. Offloading this to a workqueue may not help much if everything else ends up having to wait for that mutex. Maybe it can drop and retake it while it's doing its work?
Finally, rebuild_snap_realms uses recursion in the kernel, and I don't see what bounds that. Could a sufficiently deep snaprealm hierarchy cause it to blow out the stack? If someone plans on reworking this code, it might be good to come up with some way to do that iteratively instead of relying on recursion.
Updated by Patrick Donnelly about 2 years ago
- Related to Bug #54106: kclient: hang during workunit cleanup added
Updated by Jeff Layton about 2 years ago
I really don't see a great fix for this anywhere. So much of this work requires holding coarse-grained mutexes that I don't see how we can offload the work in any useful fashion. Aside from the unbounded recursion in this code, there is also this locking trainwreck (along with my newly added comments):
static void queue_realm_cap_snaps(struct ceph_snap_realm *realm) { struct ceph_inode_info *ci; struct inode *lastinode = NULL; dout("queue_realm_cap_snaps %p %llx inodes\n", realm, realm->ino); /* * Barf. We're walking the loop with list_for_each_entry and dropping * the spinlock on each entry. What protects the current entry from * being deleted? * * Supposedly, the inode reference (which I guess protects us from * removal during eviction) and the snap_rwsem (which supposedly * prevents changes to the realm hierarchy). * * You can also end up removing an inode from the list in response * to a cap revoke or a forced umount too, and I'm not sure those * scenarios are adequately excluded here. */ spin_lock(&realm->inodes_with_caps_lock); list_for_each_entry(ci, &realm->inodes_with_caps, i_snap_realm_item) { struct inode *inode = igrab(&ci->vfs_inode); if (!inode) continue; spin_unlock(&realm->inodes_with_caps_lock); iput(lastinode); lastinode = inode; ceph_queue_cap_snap(ci); spin_lock(&realm->inodes_with_caps_lock); } spin_unlock(&realm->inodes_with_caps_lock); iput(lastinode); dout("queue_realm_cap_snaps %p %llx done\n", realm, realm->ino); }
We may need to redesign some of this from scratch.
Updated by Greg Farnum about 2 years ago
Hmm. IIRC the equivalent code on the user space client has much less trouble because the SnapRealm has a list of inodes which are contained in that SnapRealm, whereas the kernel client doesn't and so has to iterate through every inode.
I don't know why the kclient doesn't maintain those links; perhaps we can just add them and resolve it that way?
Updated by Greg Farnum about 2 years ago
- Assignee changed from Jeff Layton to Xiubo Li
Updated by Xiubo Li about 2 years ago
- Status changed from In Progress to Fix Under Review
Updated by Stefan Kooman almost 2 years ago
The Patchwork link from @xiubo Li doesn't work for me. Has this been merged in upstream kernel already?
Updated by Xiubo Li almost 2 years ago
Stefan Kooman wrote:
The Patchwork link from @xiubo Li doesn't work for me. Has this been merged in upstream kernel already?
Yeah, the commit is:
commit 2e586641c950e7f3e7e008404bd783a466b9b590 Author: Xiubo Li <xiubli@redhat.com> Date: Sat Feb 19 14:28:33 2022 +0800 ceph: do not update snapshot context when there is no new snapshot
Updated by Xiubo Li almost 2 years ago
- Status changed from Fix Under Review to Resolved
Updated by Ilya Kogan about 1 year ago
Has this been released? I believe I'm hitting it with Ceph 17.2.5 and kernel 5.4.0-136 on Ubuntu 20.04.