Project

General

Profile

Feature #38951

implement buffered unlink in libcephfs

Added by Jeff Layton 3 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Performance/Resource Usage
Target version:
Start date:
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
libcephfs
Labels (FS):
Pull request ID:

Description

We have an open project to teach the client how to buffer creates when it has the right caps (Fx), and a delegated set of inode ranges. That's a fairly large project though, so first we should have the client buffer unlinks. That doesn't require dealing with a delegated set of inode ranges, and is simpler to implement.


Related issues

Related to fs - Feature #24461: cephfs: improve file create performance buffering file create operations New 06/08/2018
Related to fs - Feature #39098: mds: allow handing out Lx caps New
Related to fs - Feature #39354: mds: derive wrlock from excl caps In Progress 04/17/2019
Related to fs - Feature #39129: create mechanism to delegate ranges of inode numbers to client New

History

#1 Updated by Jeff Layton 3 months ago

I'm taking the approach that if we have to contact the server at all, then we probably should just send a synchronous delete. Most of the info we need to do a buffered unlink is in the Dentry object already. I'm thinking that we may want to just convert those to negative dentries that are flagged as "buffered". Then, Client::unlink can just queue them up asynchronously.

We may also need some mechanism to wait for all buffered unlinks in a directory to complete to handle fsync on a dir. We'll also need to think about how to handle recursive unlinks. I may punt on that initially and just plan to flush out any buffered unlinks when rmdir is called as well.

#2 Updated by Patrick Donnelly 3 months ago

  • Target version set to v15.0.0
  • Start date deleted (03/27/2019)
  • Source set to Development

Jeff, please link the various tracker tickets you create to sub-task the project with "related issues" so they don't get lost.

#3 Updated by Jeff Layton 3 months ago

  • Related to Feature #24461: cephfs: improve file create performance buffering file create operations added

#4 Updated by Jeff Layton 3 months ago

Today, we have no support for asynchronous MDS requests. make_request always blocks on the request. So the first step is to refactor out parts of make_request such that we can issue MetaRequests asynchronously. make_request can then call the new helpers and do a simple blocking wait on the result, and async callers can set it up to do something different when the reply comes in.

The typical pattern seems to be to use the Context subsystem for this sort of thing? In C I'd convert the condition variable to a function pointer and opaque argument, but there is probably some better way to do it in C++. The Context class looks like the right thing to use, but I haven't quite determined how best to use it.

#5 Updated by Patrick Donnelly 3 months ago

Jeff Layton wrote:

Today, we have no support for asynchronous MDS requests. make_request always blocks on the request. So the first step is to refactor out parts of make_request such that we can issue MetaRequests asynchronously. make_request can then call the new helpers and do a simple blocking wait on the result, and async callers can set it up to do something different when the reply comes in.

Yes, this is the first technical issue in the client that needs addressed.

The typical pattern seems to be to use the Context subsystem for this sort of thing? In C I'd convert the condition variable to a function pointer and opaque argument, but there is probably some better way to do it in C++. The Context class looks like the right thing to use, but I haven't quite determined how best to use it.

The Context mechanisms would be the right choice. C++17 asynchronous builtins are still kinda terrible so the Ceph callbacks continue to be the default.

There may be a better approach I'm not aware of though. I'd suggest a posting to ceph-devel seeking feedback on the proposed async redesign.

#6 Updated by Jeff Layton 3 months ago

I've started going through the kernel client, as I figured this would be more useful there initially (and because I understand the object lifetimes there better).

Now that I've started to look, I wonder is whether we should really attempt to buffer these things and flush them out later, or would it be better to just return immediately from a directory morphing operation, just after firing off the operation to the server when we have the appropriate caps buffered.

Buffering would allow us to move toward batched create/delete operations, but it's not 100% clear to me that that will give us measurably better performance than just allowing these operations to run in parallel by allowing the syscalls to return before the MDS reply comes in.

That sort of mechanism may be a lot simpler to implement.

#7 Updated by Patrick Donnelly 3 months ago

Jeff Layton wrote:

I've started going through the kernel client, as I figured this would be more useful there initially (and because I understand the object lifetimes there better).

Now that I've started to look, I wonder is whether we should really attempt to buffer these things and flush them out later, or would it be better to just return immediately from a directory morphing operation, just after firing off the operation to the server when we have the appropriate caps buffered.

Buffering would allow us to move toward batched create/delete operations, but it's not 100% clear to me that that will give us measurably better performance than just allowing these operations to run in parallel by allowing the syscalls to return before the MDS reply comes in.

That sort of mechanism may be a lot simpler to implement.

Batching the operations is an orthogonal design goal. It's not really necessary for this work.

#8 Updated by Jeff Layton 3 months ago

#9 Updated by Jeff Layton 3 months ago

Ok, I have a prototype implementation that depends on a couple of small MDS patches that are discussed here https://tracker.ceph.com/issues/39098.

The kernel patches still need a bit of cleanup but I'll probably post a preliminary set to the mailing list in the near future.

#10 Updated by Jeff Layton 2 months ago

I ran fsstress on these patches today and hit a deadlock:

# cat /sys/kernel/debug/ceph/fe6e8942-14b8-4783-aed2-323c1686333e.client4398/mdsc 
112    mds0    unlink     #1000000a275/t_mtab~30814 (test/t_mtab~30814)
113    mds0    create     #1000000a275/t_mtab~30816 (test/t_mtab~30816)

...we issued a an unlink and create in parallel, but the create happened first and the MDS attempted to revoke Fsx on the directory, but we had already taken references to them so we're stuck.

We may need to flush and block async operations when we need to do a synchronous operation on the directory to prevent this.

#11 Updated by Jeff Layton 2 months ago

Actually it may be sufficient to just wait on any existing dirops to complete before we do a synchronous one. I already had a patch to do that before issuing an rmdir on the parent, and we can just use the same routine to block synchronous link, rename and atomic_open ops. Testing a patch with that now.

EDIT: actually not...we could hit a similar deadlock if a second client tried to issue a create in a directory while we're doing an async unlink in that dir. The MDS would attempt to revoke caps in order to acquire locks for the create, but the client holding them wouldn't release them until the async unlink completed.

I think that fixing this means changing how the MDS deals with cap revokes. Rather than just blocking and waiting on the caps it needs for a request, it would need to basically requeue the request and allow competing ones to proceed, and revisit it later once the caps are available.

#12 Updated by Jeff Layton 2 months ago

#13 Updated by Jeff Layton about 2 months ago

  • Related to Feature #39129: create mechanism to delegate ranges of inode numbers to client added

#14 Updated by Jeff Layton about 1 month ago

Doing more testing today with my patchset. I doctored up a version of Zheng's MDS locking rework branch with some patches to hand out Lx caps when creating files.

I have a small shell script that creates a directory and touches a bunch of files in it and then does an rm -r on the dir. That seems to regularly fail with -ENOTEMPTY. Some printk debugging shows that the MDS is returning that from the RMDIR request.

What I'm not sure of is why. I've got code in the client that makes it wait until all outstanding requests involving the directory's children are complete (a'la req->r_complete), and that seems like it should be sufficient to ensure that we've played out all of the async requests.

The odd bit is that if I have the code call /bin/sync after creating the files and before the rm -r, it seems to work consistently.

#15 Updated by Patrick Donnelly about 1 month ago

Jeff Layton wrote:

Doing more testing today with my patchset. I doctored up a version of Zheng's MDS locking rework branch with some patches to hand out Lx caps when creating files.

I have a small shell script that creates a directory and touches a bunch of files in it and then does an rm -r on the dir. That seems to regularly fail with -ENOTEMPTY. Some printk debugging shows that the MDS is returning that from the RMDIR request.

What I'm not sure of is why. I've got code in the client that makes it wait until all outstanding requests involving the directory's children are complete (a'la req->r_complete), and that seems like it should be sufficient to ensure that we've played out all of the async requests.

The odd bit is that if I have the code call /bin/sync after creating the files and before the rm -r, it seems to work consistently.

Is there cap update traffic after calling /bin/sync? Might be that the MDS won't rmdir without the caps getting updated after the unlinks? (Doesn't sound right but...)

#16 Updated by Jeff Layton about 1 month ago

Found it. The problem is actually in ceph_mdsc_build_path. When passed a positive dentry, that function will return a zero-length path with pbase set to the inode number of the dentry inode. This is not what we want for unlink, and I think probably not in other cases as well.

We need that function to always return at least one path component. I have a patch that I'll post after doing some testing.

Also available in: Atom PDF