Bug #65607
openmds deadlock between 'lookup' and the 'rename/create, etc' requests
0%
Description
This is reported by Eric, more detail please see https://www.mail-archive.com/ceph-users@ceph.io/msg24587.html
The lookup request was stuck by waiting the wrlock:
{ "description": "client_request(client.139490:14003440 lookup #0x2000285540f/family-720-cons-5.fa.nsq 2024-04-18T12:44:36.546296+0000 caller_uid=30101, caller_gid=600{600,622,})", "initiated_at": "2024-04-18T12:44:36.548789+0000", "age": 28621.412714391001, "duration": 28621.412753429999, "type_data": { "flag_point": "failed to wrlock, waiting", "reqid": { "entity": { "type": "client", "num": 139490 }, "tid": 14003440 }, "op_type": "client_request", "events": [ { "time": "2024-04-18T12:44:36.548789+0000", "event": "initiated" }, { "time": "2024-04-18T12:44:36.548790+0000", "event": "throttled" }, { "time": "2024-04-18T12:44:36.548789+0000", "event": "header_read" }, { "time": "2024-04-18T12:44:36.548794+0000", "event": "all_read" }, { "time": "2024-04-18T12:44:36.551501+0000", "event": "dispatched" }, { "time": "2024-04-18T12:44:36.555540+0000", "event": "acquired locks" }, { "time": "2024-04-18T12:44:36.555589+0000", "event": "failed to rdlock, waiting" }, { "time": "2024-04-18T12:44:36.558280+0000", "event": "failed to wrlock, waiting" } ], "locks": null } },
From the above events we can see that the lookup request has successfully acquired the locks in the path_traverse(), because in the handle_client_getattr() there has not wrlock need to be acquired, so after the locks being successfully acquired in path_traverse(), the locks must be dropped and the lookup request was retried and then went into the *path_traverse()` again and stuck.
While for the rename request it also went into the path_traverse() and have successfully acquired the filelock and nestlock, but failed to acquire the xlock of dn->lock:
8474 lov.add_wrlock(&cur->filelock); 8475 lov.add_wrlock(&cur->nestlock); 8476 if (rdlock_authlock) 8477 lov.add_rdlock(&cur->authlock); 8478 } 8479 lov.add_xlock(&dn->lock);
{ "description": "client_request(mds.1:239710 rename #0x2000285540f/family-720-cons-5.fa.nsq #0x611/20002860f27 caller_uid=0, caller_gid=0{})", "initiated_at": "2024-04-18T12:44:36.552941+0000", "age": 28621.408561975, "duration": 28621.408643424002, "type_data": { "flag_point": "failed to xlock, waiting", "reqid": { "entity": { "type": "mds", "num": 1 }, "tid": 239710 }, "op_type": "client_request", "events": [ { "time": "2024-04-18T12:44:36.552941+0000", "event": "initiated" }, { "time": "2024-04-18T12:44:36.552941+0000", "event": "throttled" }, { "time": "2024-04-18T12:44:36.552941+0000", "event": "header_read" }, { "time": "2024-04-18T12:44:36.552941+0000", "event": "all_read" }, { "time": "2024-04-18T12:44:36.554035+0000", "event": "dispatched" }, { "time": "2024-04-18T12:44:36.556007+0000", "event": "failed to wrlock, waiting" }, { "time": "2024-04-18T12:44:36.557760+0000", "event": "failed to xlock, waiting" } ], "locks": null } },
The "event": "failed to xlock, waiting" event must be from the path_traverse(), not from the handle_client_rename(), because if so there must be a "event": "acquired locks" event, which is done in the path_traverse(), before this.
So the previous lookup request must be waiting for any of filelock and nestlock.
But I still couldn't figure out exactly where is holding the dn->lock in the lookup request or somewhere else.
Updated by Xiubo Li 12 days ago
This possibly caused by the lock order issue as in https://tracker.ceph.com/issues/62123.