Project

General

Profile

Actions

Bug #56531

open

CephFS Mounts via Linux kernel not releasing locks

Added by Chris Pickett almost 2 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

We are using Ubuntu 20.04.4 LTS and have several client-mounted CephFS filesystems. The Ceph cluster is configured with 3 active OSD/MON/MDS nodes and 1 MON/MGR node.

Linux lms-prod-01 5.13.0-1031-azure #37~20.04.1-Ubuntu x86_64 GNU/Linux

There is a bug somewhere in the kernel Ceph FS driver that under some circumstances (possibly network related) fails to release a lock when the MDS requests it. This will create a chain reaction that causes the FS to be unusable by any other systems which have it mounted. The issue will be isolated to that particular FS.

Here is our SOP for getting things going again:

Ceph unhealthy - 1 clients failing to respond to capability release

To find the system mount responsible, go to the current FS controller:

> $ ceph health detail
> HEALTH_WARN 1 clients failing to respond to capability release
> [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
> mds.fs_lms_prod.cephfs-cluster-01.yxptxa(mds.0): Client lms-prod-01:lms_prod failing to respond to capability release client_id: 404125
> 

The rogue client machine is listed along with the FS which is getting held up. At this point, all that really needs to happen is to unmount the FS which will release the lock.

In theory, we should stop php7.X-fpm and apache2, umount the FS and remount, and start the same.

However, in most cases, the umount fails and reports busy. Under the pressure of getting it working again, I have resorted to restarting the target (on cli and then in azure) and putting it in MAINT mode at HAProxys. Once the system is offline, the FS and Ceph report health again within minutes.

I have attached the kernel logs which occur at the same time.

We can go weeks without having this happening, then it does. The last two days have both had incidents.
We are going to try and switch to using the userspace FUSE mounts which might bring the used version of libceph up a couple of notches and possibly bring some stability, but I don't know if that's the real answer.

Is there anyone that can shed some light on what the problem is?


Files

CephFS Bug Kernel Stacktrace.txt (12.8 KB) CephFS Bug Kernel Stacktrace.txt Chris Pickett, 07/12/2022 01:29 PM

Subtasks 1 (0 open1 closed)

CephFS - Bug #57882: Kernel Oops, kernel NULL pointer dereferenceDuplicateXiubo Li

Actions
Actions

Also available in: Atom PDF