Bug #12209
closedCephFS should have a complete timeout mechanism to avoid endless waiting or unpredictable errors
0%
Description
Recently, when we made pressure test on cephfs through ceph-dokan using in windows, there are always some ceph-dokan processes crash more or less. After diving into the problem, we found the cause of the crash, and the scenario is described as below:
1. In some test case, the dokan driver make getattr request to the ceph-dokan process in userspace.
2. Then the Client::make_request is invoked and send the request to MDS.
3. For some reason, the MDS is very busy and can not respond to the client for a long time.
4. There is a timeout mechanism for IO request in dokan, it is 120s, if reached the timeout, the request will be killed, which will cause close operation and release related resources(Fh, Inode...).
5. But at some time after 120s, MDS give the client a response for the getattr request, and the Client got signaled and do following work. But but some related variable had already been released, so it will cause some null pointer error, and the program will crash.
After studying the CephFS code, we found that there are some inappropriate code in Client::make_request method it is:
caller_cond.Wait(client_lock);
This will lead the execution waiting until the overwhelmed MDS respond to the very-old request. And the timeout handling code would never be reached.
if (!request->reply) { assert(request->aborted); assert(!request->got_unsafe); request->item.remove_myself(); mds_requests.erase(tid); put_request(request); // request map's put_request(request); // ours return -ETIMEDOUT; }
So, I think we should use timeout wait instead.
caller_cond.WaitInterval(cct, client_lock, some_configured_time);
I have attached a patch for this issue, please feel free to let me know, if any problem.
Files