Bug #12209: CephFS should have a complete timeout mechanism to avoid endless waiting or unpredictable errors - CephFS - Ceph

Actions

Copy link

Bug #12209

closed

CephFS should have a complete timeout mechanism to avoid endless waiting or unpredictable errors

Added by Wenjun Huang almost 9 years ago. Updated almost 8 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Administration/Usability

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Recently, when we made pressure test on cephfs through ceph-dokan using in windows, there are always some ceph-dokan processes crash more or less. After diving into the problem, we found the cause of the crash, and the scenario is described as below:
1. In some test case, the dokan driver make getattr request to the ceph-dokan process in userspace.
2. Then the Client::make_request is invoked and send the request to MDS.
3. For some reason, the MDS is very busy and can not respond to the client for a long time.
4. There is a timeout mechanism for IO request in dokan, it is 120s, if reached the timeout, the request will be killed, which will cause close operation and release related resources(Fh, Inode...).
5. But at some time after 120s, MDS give the client a response for the getattr request, and the Client got signaled and do following work. But but some related variable had already been released, so it will cause some null pointer error, and the program will crash.

After studying the CephFS code, we found that there are some inappropriate code in Client::make_request method it is:

caller_cond.Wait(client_lock);

This will lead the execution waiting until the overwhelmed MDS respond to the very-old request. And the timeout handling code would never be reached.

if (!request->reply) {
    assert(request->aborted);
    assert(!request->got_unsafe);
    request->item.remove_myself();
    mds_requests.erase(tid);
    put_request(request);  // request map's
    put_request(request);  // ours
    return -ETIMEDOUT;
}

So, I think we should use timeout wait instead.

caller_cond.WaitInterval(cct, client_lock, some_configured_time);

I have attached a patch for this issue, please feel free to let me know, if any problem.

Files

client_add_timeout_wait.patch (1.81 KB) client_add_timeout_wait.patch

Wenjun Huang, 07/03/2015 05:18 AM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #12209

CephFS should have a complete timeout mechanism to avoid endless waiting or unpredictable errors

Updated by huang jun almost 9 years ago

Updated by Wenjun Huang over 8 years ago

Updated by Sage Weil over 8 years ago

Updated by John Spray over 8 years ago

Updated by Greg Farnum over 8 years ago

Updated by Greg Farnum almost 8 years ago