Bug #61717: CephFS flock blocked on itself - CephFS - Ceph

Actions

Copy link

Bug #61717

closed

CephFS flock blocked on itself

Added by Niklas Hambuechen 11 months ago. Updated 9 months ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Milind Changire

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.7

ceph-qa-suite:

Component(FS):

MDS, kceph

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I had a situation where on the same machine, 30 processes were started almost simultaneously that took the same exclusive `flock()` on a CephFS file.

This resulted all the processes being stuck in the flock syscall as shown in strace:

flock(4</path/to/cephfs/mount/.my-lock>, LOCK_EX

None of the processes advanced.

When killing all except 1 of them, it was still stuck in the syscall.
Running

ceph daemon "mds.$(hostname)" dump_blocked_ops

showed a corresponding entry


        {
            "description": "client_request(client.2464309:84251 setfilelock rule 2, type 2, owner 16217184910939483155, pid 1509340, start 0, length 0, wait 1 #0x1001d5669f5 2023-06-16T17:04:49.997379+0000 caller_uid=1000, caller_gid=65534{})",
            "initiated_at": "2023-06-16T17:04:49.999396+0000",
            "age": 768.636335207,
            "duration": 768.63642045799998,
            "type_data": {
                "flag_point": "failed to add lock, waiting",
                "reqid": "client.2464309:84251",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.2464309",
                    "tid": 84251
                },
                "events": [
                    {
                        "time": "2023-06-16T17:04:49.999396+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999397+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999396+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999400+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999409+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999417+0000",
                        "event": "acquired locks" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999423+0000",
                        "event": "failed to add lock, waiting" 
                    }
                ]
            }
        }

I could work around the situation by manually killing also the last remaining processes that was stuck in `flock()`.

So this looks like the `flock()` is somehow blocked on itself.
That seems wrong.

Ceph 16.2.7, Linux 5.15.109 kernel mount

Actions

Copy link

Updated by Milind Changire 11 months ago

Assignee set to Milind Changire

need more ceph logs and the app workflow on this one to understand the situation better

Actions

Copy link

Updated by Dhairya Parmar 11 months ago

flock() follows advisory locking mechanism, can enforcing mandatory locking while mounting the CephFS using option "mand" help avoid it?

Actions

Copy link

Updated by Niklas Hambuechen 11 months ago

Milind Changire wrote:

need more ceph logs and the app workflow on this one to understand the situation better

This will be difficult for me to provide: It happened on a production cluster, and only once so far, so I had to unblock it using the work around described above.
I can provide logs that are still around though; which ones would help?

Actions

Copy link

Updated by Niklas Hambuechen 11 months ago

Dhairya Parmar wrote:

flock() follows advisory locking mechanism, can enforcing mandatory locking while mounting the CephFS using option "mand" help avoid it?

If I read https://docs.ceph.com/en/quincy/cephfs/ceph-dokan/#limitations correctly, CephFS generally doesn't support mandatory locks.

So this doesn't seem to be an option.

Actions

Copy link

Updated by Greg Farnum 9 months ago

Status changed from New to Can't reproduce

I think there must be more going on here than is understood. The MDS is blocked on getting some other internal locks before processing this request, and while being blocked for 700+ seconds is a looong time, we've no evidence from here of what it is.

I guess it's remotely possible that somehow the kernel client failed to drop some file capabilities required for this operation to succeed? But my guess is that the 30 processes simultaneously locking is just a red herring — particularly since it appears to have turned into a single MDS request (as it should have!).

Actions

Copy link

Updated by Venky Shankar 9 months ago

Greg Farnum wrote:

I think there must be more going on here than is understood. The MDS is blocked on getting some other internal locks before processing this request, and while being blocked for 700+ seconds is a looong time, we've no evidence from here of what it is.

I guess it's remotely possible that somehow the kernel client failed to drop some file capabilities required for this operation to succeed? But my guess is that the 30 processes simultaneously locking is just a red herring — particularly since it appears to have turned into a single MDS request (as it should have!).

So, when this bug was triaged last week, Milind was still investigating the issue. The setfilelock being listed in the blocked ops is fine since its blocked due to the file already locked by another client. The perceived deadlock is probably not due the flock call itself, but could be due to some other op (we recently found a deadlock b/w create and getattr).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #61717

CephFS flock blocked on itself

Updated by Milind Changire 11 months ago

Updated by Dhairya Parmar 11 months ago

Updated by Niklas Hambuechen 11 months ago

Updated by Niklas Hambuechen 11 months ago

Updated by Greg Farnum 9 months ago

Updated by Venky Shankar 9 months ago