Bug #57882
closedLinux kernel client - Bug #56531: CephFS Mounts via Linux kernel not releasing locks
Kernel Oops, kernel NULL pointer dereference
0%
Description
(repost from Ceph (#57613), I couldn't find a way to move the bug entry from one project to another)
Hello everyone,
First of all, I hope I don't get any backlash submitting a bug report from a proxmox centered setup, but I've tried their forums whit no single answer, and actually, you might be the best suited to give us a hand.
We (my client) have a 4 node setup, 3 of which hold NVME disks as OSDs, hosting multiple containers and vm on RBD, and, most importantly, using mountpoints within some containers from a cephfs pool, as it's serving mostly live webcontent. Performance is good, PGs resync at an incredible pace whenever there's a hiccup
Servers are all hosted by OVH, AMD Epyc Based for most of them, with 12Gbs internal networking. (besides one in 6Gbs, but it's not an OSD holder and not using Ceph, just client tools installed, and never gave us any issues as far as I recall)
Problem being:
One of the nodes will randomly have a kernel oops (detailed oops in the gists below), they could be a month apart, or closer to once a day recently, pretty often at times we'd rather be sleeping, 1AM, 7AM... and the lack of rest is starting to be a pain
I don't know if we could call this a regression but I don't remember having this issue under Nautilus
Actually running Ceph 16.2.9 + Kernel 5.15.53:
pve-kernel-5.15.53-1-pve: 5.15.53-1 ceph: 16.2.9-pve1
The last crash this morning at 7:08 was on a fully updated (Bios+Microcode) node too, there are no visible signs of network hiccups when this happens.
What it does trigger is, one, sometimes two, of the hosted containers will have a lockup and start spawning processes (being web backend, it's php-fpm processes) up to the max_children mark (node load shoots up to 400 or 800)... other containers seem functionnal, sql containers which don't use cephfs never have any issue. At that point, I kill the affected container(s) launcher process and reboot the node.
Also, ceph -s shows a few little things but I haven't caught in time recently, no ceph crash to be found.
mds.twix(mds.0): Client kira failing to respond to capability release client_id: 42244108
Found in the logs on oky:
... 2022-09-20T07:08:32.425026+0200 mds.twix (mds.0) 27 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 30.374676 seconds ago 2022-09-20T07:08:33.335865+0200 mgr.kira (mgr.42244276) 54315 : cluster [DBG] pgmap v54592: 193 pgs: 193 active+clean; 1.2 TiB data, 3.6 TiB used, 17 TiB / 21 TiB avail; 44 MiB/s rd, 1.9 MiB/s wr, 4.11k op/s 2022-09-20T07:08:33.604824+0200 mon.kira (mon.0) 161875 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE) 2022-09-20T07:08:33.612719+0200 mon.kira (mon.0) 161876 : cluster [DBG] mds.0 [v2:10.137.99.3:6800/202654321,v1:10.137.99.3:6801/202654321] up:active 2022-09-20T07:08:33.612750+0200 mon.kira (mon.0) 161877 : cluster [DBG] fsmap cephfs:1 {0=twix=up:active} 1 up:standby-replay 1 up:standby ... 2022-09-20T07:09:02.460259+0200 mds.twix (mds.0) 28 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 60.409911 seconds ago ... 2022-09-20T07:10:00.000085+0200 mon.kira (mon.0) 162004 : cluster [WRN] Health detail: HEALTH_WARN 1 clients failing to respond to capability release 2022-09-20T07:10:00.000101+0200 mon.kira (mon.0) 162005 : cluster [WRN] [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release 2022-09-20T07:10:00.000106+0200 mon.kira (mon.0) 162006 : cluster [WRN] mds.twix(mds.0): Client kira failing to respond to capability release client_id: 42244108 2022-09-20T07:10:01.361599+0200 mgr.kira (mgr.42244276) 54359 : cluster [DBG] pgmap v54636: 193 pgs: 193 active+clean; 1.2 TiB data, 3.6 TiB used, 17 TiB / 21 TiB avail; 74 MiB/s rd, 3.2 MiB/s wr, 6.55k op/s 2022-09-20T07:10:02.693658+0200 mds.twix (mds.0) 29 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 120.643314 seconds ago ... 2022-09-20T07:12:02.791364+0200 mds.twix (mds.0) 30 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 240.741010 seconds ago
This environment is used in production, but we might manage to allow ourselves to do some debugging if needed on one of the nodes.
Any pointers on how to help you help me would be a blessing
MDS are set as active + active-replay, no multi-active
Private network is used for Ceph and Proxmox Clustering, the hosting company doesn't give us many options anyways, but non ceph traffic is pretty light in the end.
Ask me whatever you want to know :)
[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.137.99.1/24 fsid = e7682293-7300-4929-bd36-3ff0e1907c90 mon_allow_pool_delete = true mon_host = 10.137.99.2 10.137.99.10 10.137.99.3 10.137.99.4 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_pool_default_min_size = 2 osd_pool_default_size = 2 osd_scrub_auto_repair = true public_network = 10.137.99.1/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mds.kira] host = kira mds_standby_for_name = pve [mds.oky] host = oky mds standby for name = pve [mds.twix] host = twix mds_standby_for_name = pve [mon.kira] public_addr = 10.137.99.2 [mon.oky] public_addr = 10.137.99.4 [mon.stonks] public_addr = 10.137.99.10 [mon.twix] public_addr = 10.137.99.3
Crashes with stack traces from dmesg :
https://gist.github.com/happyjaxx/3e5c13582e3609018ef24523e054791d
https://gist.github.com/happyjaxx/ccf2e58dbd0fb5e8f8c873110878a150
I'm going to monitor network stability but it might be a shot in the dark
Thanks in advance,
JaXX./.
Updated by Xiubo Li over 1 year ago
- Status changed from New to Duplicate
- Assignee set to Xiubo Li
- Parent task set to #56531
It's a known bug and I will check this today or this week.
Updated by Julien Banchet over 1 year ago
Xiubo Li wrote:
It's a known bug and I will check this today or this week.
Oh my ! I did search for anything preexisting but might not have used the best search...
Relieved to see it's being taken care of, gonna tell my client, pretty sure they're going to be happy too :)
Thank you !
JaXX./.