Bug #65607
Updated by Xiubo Li 26 days ago
This is reported by Eric, more detail please see https://www.mail-archive.com/ceph-users@ceph.io/msg24587.html
The *lookup* request was stuck by waiting the *wrlock*:
<pre>
{
"description": "client_request(client.139490:14003440 lookup #0x2000285540f/family-720-cons-5.fa.nsq 2024-04-18T12:44:36.546296+0000 caller_uid=30101, caller_gid=600{600,622,})",
"initiated_at": "2024-04-18T12:44:36.548789+0000",
"age": 28621.412714391001,
"duration": 28621.412753429999,
"type_data": {
"flag_point": "failed to wrlock, waiting",
"reqid": {
"entity": {
"type": "client",
"num": 139490
},
"tid": 14003440
},
"op_type": "client_request",
"events": [
{
"time": "2024-04-18T12:44:36.548789+0000",
"event": "initiated"
},
{
"time": "2024-04-18T12:44:36.548790+0000",
"event": "throttled"
},
{
"time": "2024-04-18T12:44:36.548789+0000",
"event": "header_read"
},
{
"time": "2024-04-18T12:44:36.548794+0000",
"event": "all_read"
},
{
"time": "2024-04-18T12:44:36.551501+0000",
"event": "dispatched"
},
{
"time": "2024-04-18T12:44:36.555540+0000",
"event": "acquired locks"
},
{
"time": "2024-04-18T12:44:36.555589+0000",
"event": "failed to rdlock, waiting"
},
{
"time": "2024-04-18T12:44:36.558280+0000",
"event": "failed to wrlock, waiting"
}
],
"locks": null
}
},
</pre>
From the above events we can see that the *lookup* request has successfully acquired the locks in the *path_traverse()*, because in the *handle_client_getattr()* there has not *wrlock* need to be acquired, so after the locks being successfully acquired in *path_traverse()*, the locks must be *dropped* and the *lookup* request was retried and then went into the *path_traverse()` again and stuck.
While for the *rename* request it also went into the *path_traverse()* and have successfully acquired the *filelock* and *nestlock*, but failed to acquire the *xlock of dn->lock*:
<pre>
8474 lov.add_wrlock(&cur->filelock);
8475 lov.add_wrlock(&cur->nestlock);
8476 if (rdlock_authlock)
8477 lov.add_rdlock(&cur->authlock);
8478 }
8479 lov.add_xlock(&dn->lock);
</pre>
<pre>
{
"description": "client_request(mds.1:239710 rename #0x2000285540f/family-720-cons-5.fa.nsq #0x611/20002860f27 caller_uid=0, caller_gid=0{})",
"initiated_at": "2024-04-18T12:44:36.552941+0000",
"age": 28621.408561975,
"duration": 28621.408643424002,
"type_data": {
"flag_point": "failed to xlock, waiting",
"reqid": {
"entity": {
"type": "mds",
"num": 1
},
"tid": 239710
},
"op_type": "client_request",
"events": [
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "initiated"
},
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "throttled"
},
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "header_read"
},
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "all_read"
},
{
"time": "2024-04-18T12:44:36.554035+0000",
"event": "dispatched"
},
{
"time": "2024-04-18T12:44:36.556007+0000",
"event": "failed to wrlock, waiting"
},
{
"time": "2024-04-18T12:44:36.557760+0000",
"event": "failed to xlock, waiting"
}
],
"locks": null
}
},
</pre>
The *"event": "failed to xlock, waiting"* event must be from the *path_traverse()*, not from the *handle_client_rename()*, because if so there must be a *"event": "acquired locks"* event, which is done in the *path_traverse()*, before this.
So the previous *lookup* request must be waiting for any of *filelock* and *nestlock*.
But I still couldn't figure out exactly where is holding the *dn->lock* in the *lookup* request or somewhere else.