Project

General

Profile

Bug #65607

Updated by Xiubo Li 26 days ago

This is reported by Eric, more detail please see https://www.mail-archive.com/ceph-users@ceph.io/msg24587.html 

 The *lookup* request was stuck by waiting the *wrlock*: 

 <pre> 
         { 
             "description": "client_request(client.139490:14003440 lookup #0x2000285540f/family-720-cons-5.fa.nsq 2024-04-18T12:44:36.546296+0000 caller_uid=30101, caller_gid=600{600,622,})", 
             "initiated_at": "2024-04-18T12:44:36.548789+0000",                                                                                                                                                                                                 
             "age": 28621.412714391001, 
             "duration": 28621.412753429999, 
             "type_data": { 
                 "flag_point": "failed to wrlock, waiting", 
                 "reqid": { 
                     "entity": { 
                         "type": "client", 
                         "num": 139490 
                     }, 
                     "tid": 14003440 
                 }, 
                 "op_type": "client_request", 
                 "events": [ 
                     { 
                         "time": "2024-04-18T12:44:36.548789+0000", 
                         "event": "initiated" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.548790+0000", 
                         "event": "throttled" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.548789+0000", 
                         "event": "header_read" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.548794+0000", 
                         "event": "all_read" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.551501+0000", 
                         "event": "dispatched" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.555540+0000", 
                         "event": "acquired locks" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.555589+0000", 
                         "event": "failed to rdlock, waiting" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.558280+0000", 
                         "event": "failed to wrlock, waiting" 
                     } 
                 ], 
                 "locks": null 
             } 
         },                             
 </pre> 

 From the above events we can see that the *lookup* request has successfully acquired the locks in the *path_traverse()*, because in the *handle_client_getattr()* there has not *wrlock* need to be acquired, so after the locks being successfully acquired in *path_traverse()*, the locks must be *dropped* and the *lookup* request was retried and then went into the *path_traverse()` again and stuck. 


 While for the *rename* request it also went into the *path_traverse()* and have successfully acquired the *filelock* and *nestlock*, but failed to acquire the *xlock of dn->lock*: 

 <pre> 
 8474               lov.add_wrlock(&cur->filelock); 
 8475               lov.add_wrlock(&cur->nestlock); 
 8476               if (rdlock_authlock) 
 8477                 lov.add_rdlock(&cur->authlock); 
 8478             } 
 8479             lov.add_xlock(&dn->lock); 
 </pre> 

 <pre> 
         { 
             "description": "client_request(mds.1:239710 rename #0x2000285540f/family-720-cons-5.fa.nsq #0x611/20002860f27 caller_uid=0, caller_gid=0{})", 
             "initiated_at": "2024-04-18T12:44:36.552941+0000", 
             "age": 28621.408561975, 
             "duration": 28621.408643424002, 
             "type_data": { 
                 "flag_point": "failed to xlock, waiting", 
                 "reqid": { 
                     "entity": { 
                         "type": "mds", 
                         "num": 1 
                     }, 
                     "tid": 239710 
                 }, 
                 "op_type": "client_request", 
                 "events": [ 
                     { 
                         "time": "2024-04-18T12:44:36.552941+0000", 
                         "event": "initiated" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.552941+0000", 
                         "event": "throttled" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.552941+0000", 
                         "event": "header_read" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.552941+0000", 
                         "event": "all_read" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.554035+0000", 
                         "event": "dispatched" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.556007+0000", 
                         "event": "failed to wrlock, waiting" 
                     }, 
                     { 
                         "time": "2024-04-18T12:44:36.557760+0000",                                                                                                                                                                                             
                         "event": "failed to xlock, waiting" 
                     } 
                 ], 
                 "locks": null 
             } 
         }, 
 </pre> 


 The *"event": "failed to xlock, waiting"* event must be from the *path_traverse()*, not from the *handle_client_rename()*, because if so there must be a *"event": "acquired locks"* event, which is done in the *path_traverse()*, before this. 

 So the previous *lookup* request must be waiting for any of *filelock* and *nestlock*. 

 But I still couldn't figure out exactly where is holding the *dn->lock* in the *lookup* request or somewhere else.

Back