Project

General

Profile

Actions

Bug #61717

closed

CephFS flock blocked on itself

Added by Niklas Hambuechen 11 months ago. Updated 9 months ago.

Status:
Can't reproduce
Priority:
Normal
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had a situation where on the same machine, 30 processes were started almost simultaneously that took the same exclusive `flock()` on a CephFS file.

This resulted all the processes being stuck in the flock syscall as shown in strace:

flock(4</path/to/cephfs/mount/.my-lock>, LOCK_EX

None of the processes advanced.

When killing all except 1 of them, it was still stuck in the syscall.
Running

ceph daemon "mds.$(hostname)" dump_blocked_ops

showed a corresponding entry


        {
            "description": "client_request(client.2464309:84251 setfilelock rule 2, type 2, owner 16217184910939483155, pid 1509340, start 0, length 0, wait 1 #0x1001d5669f5 2023-06-16T17:04:49.997379+0000 caller_uid=1000, caller_gid=65534{})",
            "initiated_at": "2023-06-16T17:04:49.999396+0000",
            "age": 768.636335207,
            "duration": 768.63642045799998,
            "type_data": {
                "flag_point": "failed to add lock, waiting",
                "reqid": "client.2464309:84251",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.2464309",
                    "tid": 84251
                },
                "events": [
                    {
                        "time": "2023-06-16T17:04:49.999396+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999397+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999396+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999400+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999409+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999417+0000",
                        "event": "acquired locks" 
                    },
                    {
                        "time": "2023-06-16T17:04:49.999423+0000",
                        "event": "failed to add lock, waiting" 
                    }
                ]
            }
        }

I could work around the situation by manually killing also the last remaining processes that was stuck in `flock()`.

So this looks like the `flock()` is somehow blocked on itself.
That seems wrong.

Ceph 16.2.7, Linux 5.15.109 kernel mount

Actions

Also available in: Atom PDF