Bug #57882: Kernel Oops, kernel NULL pointer dereference - CephFS - Ceph

Actions

Copy link

Bug #57882

closed

Linux kernel client - Bug #56531: CephFS Mounts via Linux kernel not releasing locks

Kernel Oops, kernel NULL pointer dereference

Added by Julien Banchet over 1 year ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Xiubo Li

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.9

ceph-qa-suite:

Component(FS):

Labels (FS):

crash, multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

(repost from Ceph (#57613), I couldn't find a way to move the bug entry from one project to another)

Hello everyone,

First of all, I hope I don't get any backlash submitting a bug report from a proxmox centered setup, but I've tried their forums whit no single answer, and actually, you might be the best suited to give us a hand.

We (my client) have a 4 node setup, 3 of which hold NVME disks as OSDs, hosting multiple containers and vm on RBD, and, most importantly, using mountpoints within some containers from a cephfs pool, as it's serving mostly live webcontent. Performance is good, PGs resync at an incredible pace whenever there's a hiccup

Servers are all hosted by OVH, AMD Epyc Based for most of them, with 12Gbs internal networking. (besides one in 6Gbs, but it's not an OSD holder and not using Ceph, just client tools installed, and never gave us any issues as far as I recall)

Problem being:
One of the nodes will randomly have a kernel oops (detailed oops in the gists below), they could be a month apart, or closer to once a day recently, pretty often at times we'd rather be sleeping, 1AM, 7AM... and the lack of rest is starting to be a pain

I don't know if we could call this a regression but I don't remember having this issue under Nautilus
Actually running Ceph 16.2.9 + Kernel 5.15.53:

pve-kernel-5.15.53-1-pve: 5.15.53-1
ceph: 16.2.9-pve1

The last crash this morning at 7:08 was on a fully updated (Bios+Microcode) node too, there are no visible signs of network hiccups when this happens.
What it does trigger is, one, sometimes two, of the hosted containers will have a lockup and start spawning processes (being web backend, it's php-fpm processes) up to the max_children mark (node load shoots up to 400 or 800)... other containers seem functionnal, sql containers which don't use cephfs never have any issue. At that point, I kill the affected container(s) launcher process and reboot the node.

Also, ceph -s shows a few little things but I haven't caught in time recently, no ceph crash to be found.

mds.twix(mds.0): Client kira failing to respond to capability release client_id: 42244108

Found in the logs on oky:

...
2022-09-20T07:08:32.425026+0200 mds.twix (mds.0) 27 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 30.374676
seconds ago
2022-09-20T07:08:33.335865+0200 mgr.kira (mgr.42244276) 54315 : cluster [DBG] pgmap v54592: 193 pgs: 193 active+clean; 1.2 TiB data, 3.6 TiB used, 17 TiB / 21 TiB avail; 44 MiB/s rd, 1.9 MiB/s wr,
 4.11k op/s
2022-09-20T07:08:33.604824+0200 mon.kira (mon.0) 161875 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)
2022-09-20T07:08:33.612719+0200 mon.kira (mon.0) 161876 : cluster [DBG] mds.0 [v2:10.137.99.3:6800/202654321,v1:10.137.99.3:6801/202654321] up:active
2022-09-20T07:08:33.612750+0200 mon.kira (mon.0) 161877 : cluster [DBG] fsmap cephfs:1 {0=twix=up:active} 1 up:standby-replay 1 up:standby
...
2022-09-20T07:09:02.460259+0200 mds.twix (mds.0) 28 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 60.409911 seconds ago
...
2022-09-20T07:10:00.000085+0200 mon.kira (mon.0) 162004 : cluster [WRN] Health detail: HEALTH_WARN 1 clients failing to respond to capability release
2022-09-20T07:10:00.000101+0200 mon.kira (mon.0) 162005 : cluster [WRN] [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
2022-09-20T07:10:00.000106+0200 mon.kira (mon.0) 162006 : cluster [WRN]     mds.twix(mds.0): Client kira failing to respond to capability release client_id: 42244108
2022-09-20T07:10:01.361599+0200 mgr.kira (mgr.42244276) 54359 : cluster [DBG] pgmap v54636: 193 pgs: 193 active+clean; 1.2 TiB data, 3.6 TiB used, 17 TiB / 21 TiB avail; 74 MiB/s rd, 3.2 MiB/s wr,
 6.55k op/s
2022-09-20T07:10:02.693658+0200 mds.twix (mds.0) 29 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 120.643314
 seconds ago
...
2022-09-20T07:12:02.791364+0200 mds.twix (mds.0) 30 : cluster [WRN] client.42244108 isn't responding to mclientcaps(revoke), ino 0x1001ad1c5c2 pending pAsLsXsFr issued pAsLsXsFscr, sent 240.741010
 seconds ago

This environment is used in production, but we might manage to allow ourselves to do some debugging if needed on one of the nodes.

Any pointers on how to help you help me would be a blessing

MDS are set as active + active-replay, no multi-active
Private network is used for Ceph and Proxmox Clustering, the hosting company doesn't give us many options anyways, but non ceph traffic is pretty light in the end.

Ask me whatever you want to know :)

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.137.99.1/24
     fsid = e7682293-7300-4929-bd36-3ff0e1907c90
     mon_allow_pool_delete = true
     mon_host = 10.137.99.2 10.137.99.10 10.137.99.3 10.137.99.4
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 2
     osd_scrub_auto_repair = true
     public_network = 10.137.99.1/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.kira]
     host = kira
     mds_standby_for_name = pve

[mds.oky]
     host = oky
     mds standby for name = pve

[mds.twix]
     host = twix
     mds_standby_for_name = pve

[mon.kira]
     public_addr = 10.137.99.2

[mon.oky]
     public_addr = 10.137.99.4

[mon.stonks]
     public_addr = 10.137.99.10

[mon.twix]
     public_addr = 10.137.99.3

Crashes with stack traces from dmesg :
https://gist.github.com/happyjaxx/3e5c13582e3609018ef24523e054791d
https://gist.github.com/happyjaxx/ccf2e58dbd0fb5e8f8c873110878a150

I'm going to monitor network stability but it might be a shot in the dark

Thanks in advance,
JaXX./.

Actions

Copy link

Updated by Xiubo Li over 1 year ago

Status changed from New to Duplicate
Assignee set to Xiubo Li
Parent task set to #56531

It's a known bug and I will check this today or this week.

Actions

Copy link

Updated by Julien Banchet over 1 year ago

Xiubo Li wrote:

It's a known bug and I will check this today or this week.

Oh my ! I did search for anything preexisting but might not have used the best search...
Relieved to see it's being taken care of, gonna tell my client, pretty sure they're going to be happy too :)

Thank you !
JaXX./.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #57882

Kernel Oops, kernel NULL pointer dereference

Updated by Xiubo Li over 1 year ago

Updated by Julien Banchet over 1 year ago