Project

General

Profile

Actions

Bug #65607

open

mds deadlock between 'lookup' and the 'rename/create, etc' requests

Added by Xiubo Li 12 days ago. Updated 12 days ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is reported by Eric, more detail please see https://www.mail-archive.com/ceph-users@ceph.io/msg24587.html

The lookup request was stuck by waiting the wrlock:

        {
            "description": "client_request(client.139490:14003440 lookup #0x2000285540f/family-720-cons-5.fa.nsq 2024-04-18T12:44:36.546296+0000 caller_uid=30101, caller_gid=600{600,622,})",
            "initiated_at": "2024-04-18T12:44:36.548789+0000",                                                                                                                                                                                                
            "age": 28621.412714391001,
            "duration": 28621.412753429999,
            "type_data": {
                "flag_point": "failed to wrlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 139490
                    },
                    "tid": 14003440
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-04-18T12:44:36.548789+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548790+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548789+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548794+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.551501+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.555540+0000",
                        "event": "acquired locks" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.555589+0000",
                        "event": "failed to rdlock, waiting" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.558280+0000",
                        "event": "failed to wrlock, waiting" 
                    }
                ],
                "locks": null
            }
        },                            

From the above events we can see that the lookup request has successfully acquired the locks in the path_traverse(), because in the handle_client_getattr() there has not wrlock need to be acquired, so after the locks being successfully acquired in path_traverse(), the locks must be dropped and the lookup request was retried and then went into the *path_traverse()` again and stuck.

While for the rename request it also went into the path_traverse() and have successfully acquired the filelock and nestlock, but failed to acquire the xlock of dn->lock:

8474             lov.add_wrlock(&cur->filelock);
8475             lov.add_wrlock(&cur->nestlock);
8476             if (rdlock_authlock)
8477               lov.add_rdlock(&cur->authlock);
8478           }
8479           lov.add_xlock(&dn->lock);
        {
            "description": "client_request(mds.1:239710 rename #0x2000285540f/family-720-cons-5.fa.nsq #0x611/20002860f27 caller_uid=0, caller_gid=0{})",
            "initiated_at": "2024-04-18T12:44:36.552941+0000",
            "age": 28621.408561975,
            "duration": 28621.408643424002,
            "type_data": {
                "flag_point": "failed to xlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "mds",
                        "num": 1
                    },
                    "tid": 239710
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.554035+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.556007+0000",
                        "event": "failed to wrlock, waiting" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.557760+0000",                                                                                                                                                                                            
                        "event": "failed to xlock, waiting" 
                    }
                ],
                "locks": null
            }
        },

The "event": "failed to xlock, waiting" event must be from the path_traverse(), not from the handle_client_rename(), because if so there must be a "event": "acquired locks" event, which is done in the path_traverse(), before this.

So the previous lookup request must be waiting for any of filelock and nestlock.

But I still couldn't figure out exactly where is holding the dn->lock in the lookup request or somewhere else.

Actions #1

Updated by Xiubo Li 12 days ago

  • Description updated (diff)
Actions #2

Updated by Xiubo Li 12 days ago

As Erich mentioned he enabled multiple active MDSs, but he only updated the block ops from on MDS. I guess maybe another MDS was holding the dn->lock ?

Actions #3

Updated by Xiubo Li 12 days ago

  • Status changed from In Progress to Need More Info

Have suggested Erich to make max_mds = 1 to reproduce it to get rid of the noises.

Actions #4

Updated by Xiubo Li 12 days ago

This possibly caused by the lock order issue as in https://tracker.ceph.com/issues/62123.

Actions

Also available in: Atom PDF