Project

General

Profile

Actions

Bug #12209

closed

CephFS should have a complete timeout mechanism to avoid endless waiting or unpredictable errors

Added by Wenjun Huang almost 9 years ago. Updated almost 8 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently, when we made pressure test on cephfs through ceph-dokan using in windows, there are always some ceph-dokan processes crash more or less. After diving into the problem, we found the cause of the crash, and the scenario is described as below:
1. In some test case, the dokan driver make getattr request to the ceph-dokan process in userspace.
2. Then the Client::make_request is invoked and send the request to MDS.
3. For some reason, the MDS is very busy and can not respond to the client for a long time.
4. There is a timeout mechanism for IO request in dokan, it is 120s, if reached the timeout, the request will be killed, which will cause close operation and release related resources(Fh, Inode...).
5. But at some time after 120s, MDS give the client a response for the getattr request, and the Client got signaled and do following work. But but some related variable had already been released, so it will cause some null pointer error, and the program will crash.

After studying the CephFS code, we found that there are some inappropriate code in Client::make_request method it is:

caller_cond.Wait(client_lock);

This will lead the execution waiting until the overwhelmed MDS respond to the very-old request. And the timeout handling code would never be reached.
if (!request->reply) {
    assert(request->aborted);
    assert(!request->got_unsafe);
    request->item.remove_myself();
    mds_requests.erase(tid);
    put_request(request);  // request map's
    put_request(request);  // ours
    return -ETIMEDOUT;
}

So, I think we should use timeout wait instead.
caller_cond.WaitInterval(cct, client_lock, some_configured_time);

I have attached a patch for this issue, please feel free to let me know, if any problem.


Files

client_add_timeout_wait.patch (1.81 KB) client_add_timeout_wait.patch Wenjun Huang, 07/03/2015 05:18 AM
Actions

Also available in: Atom PDF