Project

General

Profile

Bug #12209

CephFS should have a complete timeout mechanism to avoid endless waiting or unpredictable errors

Added by Wenjun Huang over 8 years ago. Updated over 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently, when we made pressure test on cephfs through ceph-dokan using in windows, there are always some ceph-dokan processes crash more or less. After diving into the problem, we found the cause of the crash, and the scenario is described as below:
1. In some test case, the dokan driver make getattr request to the ceph-dokan process in userspace.
2. Then the Client::make_request is invoked and send the request to MDS.
3. For some reason, the MDS is very busy and can not respond to the client for a long time.
4. There is a timeout mechanism for IO request in dokan, it is 120s, if reached the timeout, the request will be killed, which will cause close operation and release related resources(Fh, Inode...).
5. But at some time after 120s, MDS give the client a response for the getattr request, and the Client got signaled and do following work. But but some related variable had already been released, so it will cause some null pointer error, and the program will crash.

After studying the CephFS code, we found that there are some inappropriate code in Client::make_request method it is:

caller_cond.Wait(client_lock);

This will lead the execution waiting until the overwhelmed MDS respond to the very-old request. And the timeout handling code would never be reached.
if (!request->reply) {
    assert(request->aborted);
    assert(!request->got_unsafe);
    request->item.remove_myself();
    mds_requests.erase(tid);
    put_request(request);  // request map's
    put_request(request);  // ours
    return -ETIMEDOUT;
}

So, I think we should use timeout wait instead.
caller_cond.WaitInterval(cct, client_lock, some_configured_time);

I have attached a patch for this issue, please feel free to let me know, if any problem.

client_add_timeout_wait.patch View (1.81 KB) Wenjun Huang, 07/03/2015 05:18 AM

History

#1 Updated by huang jun over 8 years ago

i think you can send a Pull request on github

#2 Updated by Wenjun Huang over 8 years ago

This issue is covered in http://tracker.ceph.com/issues/10944, and we have tested the two patches, which works well. And the fixes has been applied to the newest master branch.

#3 Updated by Sage Weil over 8 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (11)

#4 Updated by John Spray over 8 years ago

This all sounds a bit strange. The designed behaviour is to block on slow metadata operations, not time out. How many applications really handle -ETIMEDOUT?

Especially, during MDS failover, we want clients to stop and wait for a new MDS to come online, so that the failure is not visible to the application.

#5 Updated by Greg Farnum over 8 years ago

Yes. I'm also not very comfortable with the patches that I've looked at, but it's been a while (they have been languishing in my todo list for a while). Zheng, I see that you had some patches and mentioned putting them in the userspace clients; did that happen at some point? My grepping says no (the only ETIMEDOUT in the FS code looks to be triggered only on mount timeouts) but I might be missing something.

#6 Updated by Greg Farnum over 7 years ago

  • Category set to Administration/Usability
  • Status changed from New to Won't Fix

There's been no movement here and we didn't seem to like the idea.

Also available in: Atom PDF