Bug #65607: mds deadlock between 'lookup' and the 'rename/create, etc' requests - CephFS - Ceph

Actions

Copy link

Bug #65607

open

mds deadlock between 'lookup' and the 'rename/create, etc' requests

Added by Xiubo Li 12 days ago. Updated 12 days ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Xiubo Li

Category:

Target version:

% Done:

Source:

Tags:

Backport:

quincy,reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This is reported by Eric, more detail please see https://www.mail-archive.com/ceph-users@ceph.io/msg24587.html

The lookup request was stuck by waiting the wrlock:

        {
            "description": "client_request(client.139490:14003440 lookup #0x2000285540f/family-720-cons-5.fa.nsq 2024-04-18T12:44:36.546296+0000 caller_uid=30101, caller_gid=600{600,622,})",
            "initiated_at": "2024-04-18T12:44:36.548789+0000",                                                                                                                                                                                                
            "age": 28621.412714391001,
            "duration": 28621.412753429999,
            "type_data": {
                "flag_point": "failed to wrlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 139490
                    },
                    "tid": 14003440
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-04-18T12:44:36.548789+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548790+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548789+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548794+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.551501+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.555540+0000",
                        "event": "acquired locks" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.555589+0000",
                        "event": "failed to rdlock, waiting" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.558280+0000",
                        "event": "failed to wrlock, waiting" 
                    }
                ],
                "locks": null
            }
        },

From the above events we can see that the lookup request has successfully acquired the locks in the path_traverse(), because in the handle_client_getattr() there has not wrlock need to be acquired, so after the locks being successfully acquired in path_traverse(), the locks must be dropped and the lookup request was retried and then went into the *path_traverse()` again and stuck.

While for the rename request it also went into the path_traverse() and have successfully acquired the filelock and nestlock, but failed to acquire the xlock of dn->lock:

8474             lov.add_wrlock(&cur->filelock);
8475             lov.add_wrlock(&cur->nestlock);
8476             if (rdlock_authlock)
8477               lov.add_rdlock(&cur->authlock);
8478           }
8479           lov.add_xlock(&dn->lock);

        {
            "description": "client_request(mds.1:239710 rename #0x2000285540f/family-720-cons-5.fa.nsq #0x611/20002860f27 caller_uid=0, caller_gid=0{})",
            "initiated_at": "2024-04-18T12:44:36.552941+0000",
            "age": 28621.408561975,
            "duration": 28621.408643424002,
            "type_data": {
                "flag_point": "failed to xlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "mds",
                        "num": 1
                    },
                    "tid": 239710
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.554035+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.556007+0000",
                        "event": "failed to wrlock, waiting" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.557760+0000",                                                                                                                                                                                            
                        "event": "failed to xlock, waiting" 
                    }
                ],
                "locks": null
            }
        },

The "event": "failed to xlock, waiting" event must be from the path_traverse(), not from the handle_client_rename(), because if so there must be a "event": "acquired locks" event, which is done in the path_traverse(), before this.

So the previous lookup request must be waiting for any of filelock and nestlock.

But I still couldn't figure out exactly where is holding the dn->lock in the lookup request or somewhere else.

Actions

Copy link

Updated by Xiubo Li 12 days ago

Description updated (diff)

Actions

Copy link

Updated by Xiubo Li 12 days ago

As Erich mentioned he enabled multiple active MDSs, but he only updated the block ops from on MDS. I guess maybe another MDS was holding the dn->lock ?

Actions

Copy link

Updated by Xiubo Li 12 days ago

Status changed from In Progress to Need More Info

Have suggested Erich to make max_mds = 1 to reproduce it to get rid of the noises.

Actions

Copy link

Updated by Xiubo Li 12 days ago

This possibly caused by the lock order issue as in https://tracker.ceph.com/issues/62123.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #65607

mds deadlock between 'lookup' and the 'rename/create, etc' requests

Updated by Xiubo Li 12 days ago

Updated by Xiubo Li 12 days ago

Updated by Xiubo Li 12 days ago

Updated by Xiubo Li 12 days ago