Project

General

Profile

Bug #44100

cephfs rsync kworker high load.

Added by none none almost 3 years ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have an rsync backup running which became 10 hours. When I test with rsync one instance, it looks like it processes quite few files fast, then 20 minutes nothing happens and then processes 4 files and finishes correctly (without rsync errors)

I already posted some related questions on the mailing list
https://www.mail-archive.com/ceph-users@ceph.io/msg01718.html

echo l > /proc/sysrq-trigger

[6310351.557432] CPU: 5 PID: 3453020 Comm: kworker/5:1 Not tainted 3.10.0-1062.4.3.el7.x86_64 #1
[6310351.557433] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
[6310351.557435] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[6310351.557436] task: ffffa0d29fbe20e0 ti: ffffa0c697b48000 task.ti: ffffa0c697b48000
[6310351.557437] RIP: 0010:[<ffffffffc08d7f70>] [<ffffffffc08d7f70>] build_snap_context+0x20/0x290 [ceph]
[6310351.557438] RSP: 0018:ffffa0c697b4baa8 EFLAGS: 00000292
[6310351.557439] RAX: ffffa0c699949188 RBX: ffffa0c699949100 RCX: 00000000fffffff8
[6310351.557440] RDX: ffffa0c697b4bb68 RSI: ffffa0c697b4bb68 RDI: ffffa0c699949100
[6310351.557441] RBP: ffffa0c697b4bad8 R08: ffffa0c699948e00 R09: 0000000000027247
[6310351.557442] R10: 0000000000000036 R11: 0000000000000000 R12: ffffa0c697b4bb68
[6310351.557443] R13: ffffa0c697b4bb68 R14: ffffa0cda3a37b00 R15: ffffa0cda3a37b00
[6310351.557444] FS: 0000000000000000(0000) GS:ffffa0d2cfb40000(0000) knlGS:0000000000000000
[6310351.557445] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6310351.557446] CR2: 00007fae085ac000 CR3: 0000001813010000 CR4: 00000000000627e0
[6310351.557447] Call Trace:
[6310351.557448] [<ffffffffc08d820c>] rebuild_snap_realms+0x2c/0x90 [ceph]
[6310351.557449] [<ffffffffc08d822b>] rebuild_snap_realms+0x4b/0x90 [ceph]
[6310351.557450] [<ffffffffc08d91fc>] ceph_update_snap_trace+0x3ec/0x530 [ceph]
[6310351.557452] [<ffffffffc08e2239>] handle_reply+0x359/0xc60 [ceph]
[6310351.557453] [<ffffffffc08e48ba>] dispatch+0x11a/0xb00 [ceph]
[6310351.557454] [<ffffffffb042e56a>] ? kernel_recvmsg+0x3a/0x50
[6310351.557455] [<ffffffffc05fcff4>] try_read+0x544/0x1300 [libceph]
[6310351.557456] [<ffffffffafee13ce>] ? account_entity_dequeue+0xae/0xd0
[6310351.557457] [<ffffffffafee4d5c>] ? dequeue_entity+0x11c/0x5e0
[6310351.557458] [<ffffffffb042e417>] ? kernel_sendmsg+0x37/0x50
[6310351.557459] [<ffffffffc05fdfb4>] ceph_con_workfn+0xe4/0x1530 [libceph]
[6310351.557460] [<ffffffffb057f568>] ? __schedule+0x448/0x9c0
[6310351.557461] [<ffffffffafebe21f>] process_one_work+0x17f/0x440
[6310351.557462] [<ffffffffafebf336>] worker_thread+0x126/0x3c0
[6310351.557463] [<ffffffffafebf210>] ? manage_workers.isra.26+0x2a0/0x2a0
[6310351.557464] [<ffffffffafec61f1>] kthread+0xd1/0xe0
[6310351.557465] [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
[6310351.557466] [<ffffffffb058cd37>] ret_from_fork_nospec_begin+0x21/0x21
[6310351.557467] [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
[6310351.557469] Code: 41 5e 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 4c 8b 7f 70 <44> 8b 6f 68 44 03 6f 58 4d 85 ff 74 14 49 8b 87 b8 00 00 00 48


Related issues

Related to CephFS - Bug #54106: kclient: hang during workunit cleanup Duplicate

History

#1 Updated by none none almost 3 years ago

Did another test rsyncing 2 files, 16 minutes???

  1. concurr. link backup test2 ###
  2. start:15:06:13 ###
    rsync -avH --del --one-file-system --exclude-from=/home/backup//excluded-dirs --include-from=/home/backup//included-dirs --timeout=0 --ignore-errors --numeric-ids --inplace -e "ssh -o Compression=no -x -o StrictHostKeyChecking=no -b 192.168.10.57" --link-dest=../4/ test2:/ test2:/home test2/5/
    receiving incremental file list
    ./
    var/lib/chrony/
    var/lib/chrony/drift
    var/lib/rsyslog/
    var/lib/rsyslog/imjournal.state
    var/log/sa/sa12
sent 134,377 bytes received 27,215,506 bytes 27,696.08 bytes/sec
total size is 29,074,357,908 speedup is 1,063.05
  1. end:15:22:40 ###

#2 Updated by none none almost 3 years ago

PS. this kworker is only appearing with the 2nd rsync process. I already have a rsync session running copying from different source. With that process I do not see kworker spawning.

#3 Updated by none none almost 3 years ago

Maybe it is related to renewing capabilities? Or is it normal that the mds is asked to renew 132k caps so often?

{
"id": 3736134,
"num_leases": 0,
"num_caps": 132103,
"state": "open",
"request_load_avg": 31901,
"uptime": 875590.07943545096,
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.3736134 v1:192.168.10.114:0/696898030",
"client_metadata": {
"features": "00000000000000ff",
"entity_id": "cephfs.backup",
"hostname": "c04",
"kernel_version": "3.10.0-1062.4.3.el7.x86_64",
"root": "/backup"
}
}

2020-02-12 16:48:40.564 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325244) from client.3736134
2020-02-12 16:49:00.570 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325245) from client.3736134
2020-02-12 16:49:20.566 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325246) from client.3736134
2020-02-12 16:49:40.569 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325247) from client.3736134
2020-02-12 16:50:00.570 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325248) from client.3736134
2020-02-12 16:50:20.570 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325249) from client.3736134
2020-02-12 16:50:40.569 7fae0ade8700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 325250) from client.3736134

#4 Updated by none none almost 3 years ago

none none wrote:

PS. this kworker is only appearing with the 2nd rsync process. I already have a rsync session running copying from different source. With that process I do not see kworker spawning.

With one rsync process it is the same also kworker high load

#5 Updated by none none almost 3 years ago

If I unmount the cephfs and mount it again, problems seems to be gone.

#6 Updated by none none almost 3 years ago

Tried to work around this by not using multiple cephfs mounts. First process finished quickly, 2nd had kworker high load again.

From this:
c01,c02,c03:/backup /home/backup ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0

To this:
c01,c02,c03:/backup /home/backup ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
c01,c02,c03:/backup/h1 /home/backup/h1 ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
c01,c02,c03:/backup/h2 /home/backup/h2 ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0
c01,c02,c03:/backup/h3 /home/backup/h3 ceph name=cephfs.backup,secretfile=key,noatime,_netdev 0 0

so concurrent rsync's have seperate clients at the mds.

#7 Updated by Zheng Yan almost 3 years ago

could you check if this still happen with upstream 5.5 kernel

#8 Updated by none none almost 3 years ago

Zheng Yan wrote:

could you check if this still happen with upstream 5.5 kernel

From where should I get it? elrepo? That is not really Redhat/Centos kernel not?

#9 Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to CephFS

#10 Updated by Jeff Layton almost 3 years ago

none none wrote:

Zheng Yan wrote:

could you check if this still happen with upstream 5.5 kernel

From where should I get it? elrepo? That is not really Redhat/Centos kernel not?

No, it's not. You would probably need to build a kernel (or run this load on a newer distro like Fedora). Sometimes

The RHEL7 kernel lacks a lot of patches that have gone in upstream in the last few years. Running your test on something closer to mainline would tell us whether this is something already fixed, or whether we need to fix something upstream too.

We could also need some details about your snapshots. How many do you have?

#11 Updated by none none over 2 years ago

I did not have time to install the 5.5 kernel yet on the rsync server. Today I noticed on a CentOS7 vm, with just 1 vcpu assigned, a similar behaviour. When another vm, was using the smb share, the kworker went up, stalling even the top session.

11001 root      20   0       0      0      0 R 89.1  0.0   0:22.16 kworker/0:1
11005 backupm+  20   0  469172  15304   6720 D  5.9  1.5   0:03.05 smbd
    6 root      20   0       0      0      0 R  1.0  0.0   1:51.03 ksoftirqd/0
    9 root      20   0       0      0      0 S  1.0  0.0   0:15.92 rcu_sched
  364 root      20   0       0      0      0 S  1.0  0.0   0:38.20 xfsaild/dm-0
10944 root      20   0  164120   2256   1560 R  1.0  0.2   0:02.71 top
    1 root      20   0   43436   3248   1964 S  0.0  0.3   0:06.63 systemd

After installing the kernel-ml 5.5 there, the problem seemed to be much less, however I am getting these messages

[Mon Mar  9 23:19:57 2020] ceph: Can't lookup inode 1 (err: -13)
[Mon Mar  9 23:21:54 2020] ceph: Can't lookup inode 1 (err: -13)
[Mon Mar  9 23:28:59 2020] ceph: Can't lookup inode 1 (err: -13)

I assume these are just not printed when using the 3.10 kernel. Is this a caps issue/short coming of my cephfs user/id?

#12 Updated by none none over 2 years ago

Not really solved it with the 5.5 kernel-ml on bare metal server

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
1566 root 20 0 0 0 0 R 99.0 0.0 5:29.24 kworker/12:2+ce

#13 Updated by Jonas Jelten almost 2 years ago

I also encountered this on Ubuntu 20.04 Linux 5.4.0-60-generic #67-Ubuntu SMP Tue Jan 5 18:31:36 UTC 2021
The cluster is Nautilus, MDS are on 14.2.16.

[  +0.000000] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[  +0.000001] RIP: 0010:cmpu64_rev+0x1c/0x20 [ceph]
[  +0.000001] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 8b 16 b8 01 00 00 00 48 89 e5 48 39 17 72 08 0f 97 c0 0f b6 c0 <f7> d8 5d c3 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 49 89 fd 41 >
[  +0.000001] RSP: 0018:ffffb25b4867bb30 EFLAGS: 00000202
[  +0.000000] RAX: 0000000000000001 RBX: 0000000000000008 RCX: 0000000000000063
[  +0.000001] RDX: 0000000000000071 RSI: ffff9db0561a0a98 RDI: ffff9db0561a0a90
[  +0.000000] RBP: ffffb25b4867bb30 R08: 0000000000000138 R09: ffffffffc0456080
[  +0.000001] R10: 0000000000000d40 R11: 0000000000000001 R12: 0000000000000000
[  +0.000000] R13: 0000000000000278 R14: ffff9db0561a0818 R15: 0000000000000280
[  +0.000001] FS:  0000000000000000(0000) GS:ffff9db13ed40000(0000) knlGS:0000000000000000
[  +0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007f293d11d004 CR3: 00000019c8e0a000 CR4: 0000000000340ee0
[  +0.000000] Call Trace:
[  +0.000000]  sort_r+0xbd/0x240
[  +0.000001]  ? ceph_ino_compare+0x30/0x30 [ceph]
[  +0.000000]  sort+0xe/0x10
[  +0.000000]  build_snap_context+0x126/0x260 [ceph]
[  +0.000001]  rebuild_snap_realms+0x28/0x80 [ceph]
[  +0.000000]  rebuild_snap_realms+0x42/0x80 [ceph]
[  +0.000001]  ceph_update_snap_trace+0x207/0x4e0 [ceph]
[  +0.000000]  handle_reply+0x23d/0x6a0 [ceph]
[  +0.000000]  dispatch+0xe9/0x610 [ceph]
[  +0.000001]  ? __switch_to_asm+0x34/0x70
[  +0.000000]  ? __switch_to_asm+0x40/0x70
[  +0.000000]  ? __switch_to_asm+0x34/0x70
[  +0.000000]  ? __switch_to_asm+0x40/0x70
[  +0.000001]  ? __switch_to_asm+0x34/0x70
[  +0.000000]  try_read+0x566/0x8c0 [libceph]
[  +0.000000]  ? __switch_to_asm+0x40/0x70
[  +0.000001]  ? __switch_to_asm+0x34/0x70
[  +0.000000]  ? __switch_to_asm+0x40/0x70
[  +0.000000]  ? __switch_to+0x7f/0x470
[  +0.000000]  ? __switch_to_asm+0x40/0x70
[  +0.000001]  ? __switch_to_asm+0x34/0x70
[  +0.000000]  ceph_con_workfn+0x130/0x620 [libceph]
[  +0.000000]  process_one_work+0x1eb/0x3b0
[  +0.000001]  worker_thread+0x4d/0x400
[  +0.000000]  kthread+0x104/0x140
[  +0.000000]  ? process_one_work+0x3b0/0x3b0
[  +0.000000]  ? kthread_park+0x90/0x90
[  +0.000001]  ret_from_fork+0x22/0x40

in ceph daemon mds.server dump snaps 161 are listed.

Please advise what to do :)

This client has really slow performance now (5% iops). After rebooting it, it's gone, but reappears shortly when the machine is "used" again.

#14 Updated by Andras Sali over 1 year ago

We have also experienced a similar issue, where kernel mount performance degraded severely while doing rsync (running on CEPH 15.2.9 and a few hundred snapshots) and had to revert to using the FUSE mount for slower but more stable performance.

Others have reported in more detail similar issues: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EI2FDN7ELTAJ4SJUNRULYTA3ZEGOYSWU/

#15 Updated by Andras Pataki 12 months ago

We've recently started using cephfs snapshots and are running into a similar issue with the kernel client. It seems like traversing the file system and stat'ing files have become extremely slow. Some (not all) stat operations went from microseconds to hundreds of milliseconds in duration. The details are from a node that runs a file system scan:

The lstat calls of the file system scan take 200ms+, here is the strace -T -tt output:

07:05:28.909309 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp", {st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695>
07:05:29.182086 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp", {st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555>
07:05:29.450680 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp", {st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040>
07:05:29.720758 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp", {st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>

The file system scanning process is constantly stuck in 'D' state with /proc/<pid>/stack:

[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph]
[<0>] ceph_mdsc_do_request+0x82/0x90 [ceph]
[<0>] ceph_d_revalidate+0x207/0x300 [ceph]
[<0>] lookup_fast+0x179/0x210
[<0>] walk_component+0x44/0x320
[<0>] path_lookupat+0x7b/0x220
[<0>] filename_lookup+0xa5/0x170
[<0>] vfs_statx+0x6e/0xd0
[<0>] __do_sys_newlstat+0x39/0x70
[<0>] do_syscall_64+0x4a/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

One of the kernel threads is using close to 100% of a single core constantly:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1350556 root 20 0 0 0 0 R 100.0 0.0 4:52.85 [kworker/18:0+ceph-msgr]

Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows typically waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A couple of examples:

Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607c50 EFLAGS: 00000246
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de44ede7e8 RCX: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de44ede7f8
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: ffff88de0a43ca00 R08: ffff88dec0c63120 R09: ffff88de0a43ca00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de36aab878 R12: ffff88de1f3cab00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R13: ffff88de44edeb18 R14: ffff88de1f3cabc0 R15: ffff88de44edeb18
Dec 10 07:36:56 popeye-mgr-0-10 kernel: FS:  0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:36:56 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_queue_cap_snap+0x3e/0x3e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x71/0xf0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:36:56 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:36:56 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Dec 10 07:40:17 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607ca0 EFLAGS: 00000246
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de62112ef0 RCX: 0000027f51c57994
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RBP: ffff88de62112bc0 R08: 0000000000000000 R09: ffff88dbf2742600
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de62112d88 R12: ffff88de455b4c00
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R13: ffff88de62112ef0 R14: ffff88de455b4cc0 R15: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: FS:  0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:40:17 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:40:17 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x79/0xf0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:40:17 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:40:17 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Turning on kernel debugging for fs/ceph/snap.c, there seems this loop in queue_realm_cap_snaps is really long for some inodes:

list_for_each_entry(ci, &realm->inodes_with_caps, i_snap_realm_item) {
...
}
[3199649.882973] ceph:  queue_realm_cap_snaps 00000000d313f1e4 1001af8c8e7 inodes
[3199649.882974] ceph:  queue_cap_snap 00000000171f11a7 nothing dirty|writing
[3199649.882975] ceph:  queue_cap_snap 00000000938b9cd2 nothing dirty|writing
[3199649.882976] ceph:  queue_cap_snap 00000000615cf4dd nothing dirty|writing
[3199649.882977] ceph:  queue_cap_snap 000000000027e295 nothing dirty|writing
[3199649.882979] ceph:  queue_cap_snap 00000000ba18b2f8 nothing dirty|writing
[3199649.882980] ceph:  queue_cap_snap 000000007c9c80de nothing dirty|writing
[3199649.882981] ceph:  queue_cap_snap 00000000629b4b0e nothing dirty|writing
[3199649.882982] ceph:  queue_cap_snap 00000000ab330b37 nothing dirty|writing
[3199649.882983] ceph:  queue_cap_snap 00000000c7dbc320 nothing dirty|writing
[3199649.882985] ceph:  queue_cap_snap 0000000070a0598f nothing dirty|writing
[3199649.882986] ceph:  queue_cap_snap 00000000915b9e2e nothing dirty|writing
... (and a lot lot more of these) ...

At this point the client has about a million caps (running up against the default cap limit) - so potentially this loop is over all the caps (?), which could mean tens/hundreds of milliseconds? Indeed, reducing mds_max_caps_per_client by an order of magnitude does improve the lstat times by about an order of magnitude (which is still pretty slow - but supports this hypothesis).

The ceph cluster is Nautilus 14.2.20. There are a total of 7 snapshots in cephfs, all taken at the root of the cephfs tree (a rolling set of 7 previous daily snapshots).

I've tested this with a few kernels: two LTS ones, and one more recent stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result. Any ideas/suggestions?

#16 Updated by Jeff Layton 10 months ago

Some thoughts...

This is a pretty nasty problem, in that the MDS is essentially handing the client an unbounded amount of work to so. Worse, it currently does this in the receive thread, so this effectively blocks any receive off the socket until it completes.

One thing we could do here is simply offload this work to a workqueue job. I don't think we necessarily need to do all of this synchronously as part of handling a reply.

That said, the comment at the top of ceph_update_snap_trace indicates that you need to hold the snap_rwsem for write, which is a large coarse-grained mutex. Offloading this to a workqueue may not help much if everything else ends up having to wait for that mutex. Maybe it can drop and retake it while it's doing its work?

Finally, rebuild_snap_realms uses recursion in the kernel, and I don't see what bounds that. Could a sufficiently deep snaprealm hierarchy cause it to blow out the stack? If someone plans on reworking this code, it might be good to come up with some way to do that iteratively instead of relying on recursion.

#17 Updated by Jeff Layton 10 months ago

  • Assignee set to Jeff Layton

#18 Updated by Patrick Donnelly 10 months ago

  • Related to Bug #54106: kclient: hang during workunit cleanup added

#19 Updated by Jeff Layton 10 months ago

I really don't see a great fix for this anywhere. So much of this work requires holding coarse-grained mutexes that I don't see how we can offload the work in any useful fashion. Aside from the unbounded recursion in this code, there is also this locking trainwreck (along with my newly added comments):

static void queue_realm_cap_snaps(struct ceph_snap_realm *realm)
{
        struct ceph_inode_info *ci;
        struct inode *lastinode = NULL;

        dout("queue_realm_cap_snaps %p %llx inodes\n", realm, realm->ino);

        /*
         * Barf. We're walking the loop with list_for_each_entry and dropping
         * the spinlock on each entry. What protects the current entry from
         * being deleted?
         *
         * Supposedly, the inode reference (which I guess protects us from
         * removal during eviction) and the snap_rwsem (which supposedly
         * prevents changes to the realm hierarchy).
         *
         * You can also end up removing an inode from the list in response
         * to a cap revoke or a forced umount too, and I'm not sure those
         * scenarios are adequately excluded here.
         */
        spin_lock(&realm->inodes_with_caps_lock);
        list_for_each_entry(ci, &realm->inodes_with_caps, i_snap_realm_item) {
                struct inode *inode = igrab(&ci->vfs_inode);
                if (!inode)
                        continue;
                spin_unlock(&realm->inodes_with_caps_lock);
                iput(lastinode);
                lastinode = inode;
                ceph_queue_cap_snap(ci);
                spin_lock(&realm->inodes_with_caps_lock);
        }
        spin_unlock(&realm->inodes_with_caps_lock);
        iput(lastinode);

        dout("queue_realm_cap_snaps %p %llx done\n", realm, realm->ino);
}

We may need to redesign some of this from scratch.

#20 Updated by Greg Farnum 10 months ago

Hmm. IIRC the equivalent code on the user space client has much less trouble because the SnapRealm has a list of inodes which are contained in that SnapRealm, whereas the kernel client doesn't and so has to iterate through every inode.
I don't know why the kclient doesn't maintain those links; perhaps we can just add them and resolve it that way?

#21 Updated by Greg Farnum 10 months ago

  • Assignee changed from Jeff Layton to Xiubo Li

#22 Updated by Xiubo Li 10 months ago

  • Status changed from New to In Progress

Will work on it.

#23 Updated by Xiubo Li 10 months ago

  • Status changed from In Progress to Fix Under Review

#24 Updated by Stefan Kooman 6 months ago

The Patchwork link from @xiubo Li doesn't work for me. Has this been merged in upstream kernel already?

#25 Updated by Xiubo Li 6 months ago

Stefan Kooman wrote:

The Patchwork link from @xiubo Li doesn't work for me. Has this been merged in upstream kernel already?

Yeah, the commit is:

commit 2e586641c950e7f3e7e008404bd783a466b9b590
Author: Xiubo Li <xiubli@redhat.com>
Date:   Sat Feb 19 14:28:33 2022 +0800

    ceph: do not update snapshot context when there is no new snapshot

#26 Updated by Xiubo Li 6 months ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF