Project

General

Profile

Feature #24461

cephfs: improve file create performance buffering file create operations

Added by Patrick Donnelly over 1 year ago. Updated 23 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Client, MDS, kceph
Labels (FS):
task(hard), task(intern)
Pull request ID:

Description

Serialized single-client file creation (e.g. untar/rsync) is an area CephFS (and most distributed file systems) continues to be weak on. Improving this is difficult without removing the round-trip with the MDS. One possibility for allowing this is to allocate a block of inodes to the client to create new files with. The client may then asynchronously solidify the creation of those files. To do this, the client should have a new cap for directories (can we reuse CEPH_CAP_GWR?) which guarantees exclusive access to the directory.


Related issues

Related to fs - Feature #18477: O_TMPFILE support in libcephfs New 01/10/2017
Related to fs - Feature #38951: implement buffered unlink in libcephfs New
Related to fs - Feature #39129: create mechanism to delegate ranges of inode numbers to client Resolved

History

#1 Updated by Jeff Layton over 1 year ago

Neat. NFS and SMB have directory delegations/leases, but I haven't studied the topic in detail.

So the idea is to preallocate anonymous inodes and grant them to the client, and then the client can just fill them out and add links for them in a directory where it has the appropriate caps? Done correctly, this might also be helpful for O_TMPFILE style anonymous creates as well, which would be a nice-to-have.

How will you handle the case where we start to fill out an anonymous inode before linking it into the directory, but then lose the GWR caps on the directory before you can link it in?

#2 Updated by Greg Farnum over 1 year ago

We've talked about this quite a lot in the past. I thought we had a tracker ticket for it, but on searching the most relevant thing I see is an old email archived at https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg27317.html

I think you'll find that file creates are just about the least scalable thing you can do on CephFS right now, though, so there is some easier ground. One obvious approach is to extend the current inode preallocation — it already allocates inodes per-client and has a fast path inside of the MDS for handing them back. It'd be great if clients were aware of that preallocation and could create files without waiting for the MDS to talk back to them! The issue with this is two-fold:

1) need to update the cap flushing protocol to deal with files newly created by the client
2) need to handle all the backtrace stuff normally performed by the MDS on file create (which still needs to happen, on either the client or the server)

There's also clean up in case of a client failure, but we've already got a model for that in how we figure out real file sizes and things based on max size.

#3 Updated by Patrick Donnelly over 1 year ago

  • Tracker changed from Bug to Feature

#4 Updated by Patrick Donnelly 12 months ago

  • Assignee set to Jeff Layton

#5 Updated by Jeff Layton 12 months ago

  • Target version changed from v14.0.0 to v15.0.0

#6 Updated by Jeff Layton 12 months ago

I think this is not really a single project, but a set of them. At a high level:

  • ensure that the MDS can hand out exclusive caps on directories, sufficient to allow dentries to be linked into it without involvement from the MDS (Patrick seemed to think it currently doesn't hand those caps out)
  • have MDS hand a range of inode numbers to a client when it has exclusive caps on the directory. Greg mentions above that we already preallocate ranges of inodes per-client, but the clients aren't aware of it. We'd need a way to inform the client of the range it is currently allowed to use (maybe extend MClientCaps?).
  • add a way for the the client to send buffered creates in batches. We'll probably need a CEPH_MDS_OP_CREATE_BATCH call for this (at a minimum)? If the client is sending a batched create, we'll probably want to assume that it starts out with full exclusive caps for any inode that it is creating (to better handle the untar use-case). Still, we'll have to handle the situation where the client crashes before it can send back the caps, so we will need to send full inode info in the batched create call.
  • a batched UNLINK call could be useful too, and may be an easier place to get started here.
  • It might be nice to consider the O_TMPFILE case here as well. We could allow the client to create unlinked, but open inodes, and then link them in after the fact. Possibly consider a batched LINK call as well?

A lot of questions still need to be resolved:

  • what should trigger the client to flush the buffered creates back to the server? An fsync on the directory and syncfs, obviously, and maybe when we exhaust our preallocated inode number range?
  • We'll also need to think about how many creates we can reasonably allow the client to buffer at a time. 10? 100? 1000? Maybe we'll want to use a window that increases exponentially as the client exhausts its range of numbers (suitably capped of course). Ideally, the reply to a batched create call would inform the client of the most current inode number range(s).
  • Do we need separate batched MKDIR and MKNOD calls too, or can we get away with a single, generic, CREATE call that sends the type?

#7 Updated by Patrick Donnelly 12 months ago

Jeff Layton wrote:

I think this is not really a single project, but a set of them. At a high level:

ensure that the MDS can hand out exclusive caps on directories, sufficient to allow dentries to be linked into it without involvement from the MDS (Patrick seemed to think it currently doesn't hand those caps out)

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

have MDS hand a range of inode numbers to a client when it has exclusive caps on the directory. Greg mentions above that we already preallocate ranges of inodes per-client, but the clients aren't aware of it. We'd need a way to inform the client of the range it is currently allowed to use (maybe extend MClientCaps?).
add a way for the the client to send buffered creates in batches. We'll probably need a CEPH_MDS_OP_CREATE_BATCH call for this (at a minimum)? If the client is sending a batched create, we'll probably want to assume that it starts out with full exclusive caps for any inode that it is creating (to better handle the untar use-case). Still, we'll have to handle the situation where the client crashes before it can send back the caps, so we will need to send full inode info in the batched create call.

Let's also be clear about the advantages of batched create: we obtain the necessary locks once (!) and fewer messages (Anything else?). If the client has exclusive caps for the directory inode, the batched create should also trivially obtain all the locks too.

I think it's reasonable that the batched file create should be per-directory to simplify locking. We should probably also require the client has WR|EXCL in order to use it.

Also, I think it should behave like openat, taking the directory inode, dentry name, and the inode #.

a batched UNLINK call could be useful too, and may be an easier place to get started here.

indeed!

It might be nice to consider the O_TMPFILE case here as well. We could allow the client to create unlinked, but open inodes, and then link them in after the fact. Possibly consider a batched LINK call as well?

Yes, that should be easy to fit in.

A lot of questions still need to be resolved:

what should trigger the client to flush the buffered creates back to the server? An fsync on the directory and syncfs, obviously, and maybe when we exhaust our preallocated inode number range?

Some tick in the client every ~5 seconds or preallocated inode pressure.

We'll also need to think about how many creates we can reasonably allow the client to buffer at a time. 10? 100? 1000? Maybe we'll want to use a window that increases exponentially as the client exhausts its range of numbers (suitably capped of course). Ideally, the reply to a batched create call would inform the client of the most current inode number range(s).

Whatever the limit is, we'll need to figure it out through experimentation. Obviously, there should be one but I think it should just be dictated by the MDS and not the client.

Do we need separate batched MKDIR and MKNOD calls too, or can we get away with a single, generic, CREATE call that sends the type?

Single generic CREATE is attractive.

#8 Updated by Jeff Layton 12 months ago

Patrick Donnelly wrote:

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

This is something that would be very nice to have clearly documented somewhere. The caps system is great, but it's somewhat difficult to tell what caps you actually need for particular operations.

Let's also be clear about the advantages of batched create: we obtain the necessary locks once (!) and fewer messages (Anything else?). If the client has exclusive caps for the directory inode, the batched create should also trivially obtain all the locks too.

I think it's reasonable that the batched file create should be per-directory to simplify locking. We should probably also require the client has WR|EXCL in order to use it.

Yes to both. I see no benefit to batching up creates across different parent directories.

Also, I think it should behave like openat, taking the directory inode, dentry name, and the inode #.

Not just the inode number. The MDS has to fully instantiate these inodes, so we'll need to send full inode info -- current size, c/mtime, layout info, etc.

  • It might be nice to consider the O_TMPFILE case here as well. We could allow the client to create unlinked, but open inodes, and then link them in after the fact. Possibly consider a batched LINK call as well?

Yes, that should be easy to fit in.

Note that we'd also need to handle O_TMPFILE in the case where the client can't get exclusive caps on the parent dir. That may complicate things a bit, but I think we can still do it.

A lot of questions still need to be resolved:

  • what should trigger the client to flush the buffered creates back to the server? An fsync on the directory and syncfs, obviously, and maybe when we exhaust our preallocated inode number range?

Some tick in the client every ~5 seconds or preallocated inode pressure.

Makes sense. We'll probably need some tunables until we get a better feel for this.

  • We'll also need to think about how many creates we can reasonably allow the client to buffer at a time. 10? 100? 1000? Maybe we'll want to use a window that increases exponentially as the client exhausts its range of numbers (suitably capped of course). Ideally, the reply to a batched create call would inform the client of the most current inode number range(s).

Whatever the limit is, we'll need to figure it out through experimentation. Obviously, there should be one but I think it should just be dictated by the MDS and not the client.

Absolutely.

  • Do we need separate batched MKDIR and MKNOD calls too, or can we get away with a single, generic, CREATE call that sends the type?

Single generic CREATE is attractive.

Agreed.

Another thing to consider:

What about nesting? If I mkdir -p foo/bar/baz, do we try to buffer them all or do we require flushing a parent dir back to the MDS before you can create new dentries in it? I'm inclined to allow buffering everything, but that means that we will always need to flush the parents back to the MDS before their children (which I think is OK).

#9 Updated by Jeff Layton 12 months ago

Some notes about the preallocation piece:

The prealloc_inos interval set is tracked per-session in session_info_t. It gets encoded when that object is encoded, but that seems to only occur between MDSs. That structure doesn't ever seem to be transmitted to the client.

I think we're going to have to assign the clients a different interval_set, in any case. The prealloc_inos interval set is for use by the MDS and we don't have a way to coordinate access to it with the client. I think we will need to allocate and track a separate range on a per-Session basis for this.

The MDS can't shrink this set unilaterally, so I think we'll want to dribble them out to the client in small chunks (max a hundred or so at a time at most?).

To communicate this range to the client, we could use a MClientSession message. Maybe version (somehow) and extend struct ceph_mds_session_head with the interval_set currently granted to the client? The MDS could then push updates to the client in any CEPH_SESSION_* message, and maybe we could add new calls. One the client can use to request a new set, and maybe a MDS->Client revoke function that tells it to flush everything and stop doing buffered creates.

#10 Updated by Patrick Donnelly 12 months ago

Jeff Layton wrote:

To communicate this range to the client, we could use a MClientSession message. Maybe version (somehow) and extend struct ceph_mds_session_head with the interval_set currently granted to the client? The MDS could then push updates to the client in any CEPH_SESSION_* message, and maybe we could add new calls. One the client can use to request a new set, and maybe a MDS->Client revoke function that tells it to flush everything and stop doing buffered creates.

It may also be useful to also communicate the current allocated set in MClientReply. (It should not shrink or invalidate previous allocations!)

#11 Updated by Patrick Donnelly 11 months ago

#12 Updated by Jeff Layton 11 months ago

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

Just so I'm clear...why do we care about WR caps here? If I have Fx caps on a directory, then does Fw carry any extra significance?

#13 Updated by Patrick Donnelly 11 months ago

Jeff Layton wrote:

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

Just so I'm clear...why do we care about WR caps here? If I have Fx caps on a directory, then does Fw carry any extra significance?

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

#14 Updated by Jeff Layton 11 months ago

Patrick Donnelly wrote:

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

I think so? Today I think Fx on a directory is functionally equivalent to Fs. It seems like it ought to be ok to allow the client to buffer up creates if it has Fx caps as no one should be able to get Fs until Fx is returned.

That said, I don't quite understand the distinction between Fx and Fw on normal files, so I could be wrong here.

#15 Updated by Patrick Donnelly 11 months ago

Jeff Layton wrote:

Patrick Donnelly wrote:

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

I think so? Today I think Fx on a directory is functionally equivalent to Fs. It seems like it ought to be ok to allow the client to buffer up creates if it has Fx caps as no one should be able to get Fs until Fx is returned.

Okay, here's how I think it should work but it may not be how it actually works:

Fsrc: can readdir and cache results (dentries); multiple clients may have these caps
Fsr: can readdir; cannot cache results
Fsw: doesn't make sense
Fxrc: same as Fsrc; no reason to have Fx caps without w
(new) Fxw(b) == can create new dentries buffered

Now, I haven't learned exactly how a client/MDS knows that a client's view of a directory is "complete". For Fxwb, that's important since the client doesn't want to clobber files. However, that doesn't stop the client from getting these caps before the directory is complete. It just needs to go through the motions of loading all the dentries OR getattr()ing the dentries it wants to create ahead of time (important when the directory is large!).

That said, I don't quite understand the distinction between Fx and Fw on normal files, so I could be wrong here.

I think we always give Fxwb and not just Fxw. So Fxwb would indicate the client can buffer writes. Fswb is impossible (unless we're doing LAZYIO but that's indicated with Fl instead).

#16 Updated by Zheng Yan 11 months ago

Jeff Layton wrote:

Patrick Donnelly wrote:

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

I think so? Today I think Fx on a directory is functionally equivalent to Fs. It seems like it ought to be ok to allow the client to buffer up creates if it has Fx caps as no one should be able to get Fs until Fx is returned.

That said, I don't quite understand the distinction between Fx and Fw on normal files, so I could be wrong here.

For directory inode, current mds only issues Fsx (at most) caps to client. It never issues Frwcb caps to client.

Fsx (x implies s) caps on a directory is different from Fs caps when client creates/unlinks file. If a client only has Fs caps, it has to release the Fs caps to MDS when creating/unlinking file. Losing Fs caps means dir lease becomes invalid. If a client has Fsx caps, it does not needs to release Fs/Fx caps when creating/unlinking file. Dir lease is still valid when create/unlink finishes.

#17 Updated by Jeff Layton 11 months ago

Zheng Yan wrote:

For directory inode, current mds only issues Fsx (at most) caps to client. It never issues Frwcb caps to client.

Fsx (x implies s) caps on a directory is different from Fs caps when client creates/unlinks file. If a client only has Fs caps, it has to release the Fs caps to MDS when creating/unlinking file. Losing Fs caps means dir lease becomes invalid. If a client has Fsx caps, it does not needs to release Fs/Fx caps when creating/unlinking file. Dir lease is still valid when create/unlink finishes.

Thanks Zheng. Yes, this interaction between dentry leases and caps is what I'm trying to sort out here, but this seems to be contrary to what Sage and Greg were saying on ceph-devel.

Maybe we need a concrete example:

Suppose we have a directory (/foo) with a file in it (/foo/bar). We have a Fs caps on /foo and a dentry lease on /foo/bar. Another client then creates a new file in that directory (/foo/baz). At that point the MDS revokes Fs caps from the first client. Does the dentry lease on /foo/bar become invalid at that point?

If so why? I thought the whole point of dentry leases was so that you could make changes to a parent dir without invalidating every dentry under it.

#18 Updated by Jeff Layton 11 months ago

Patrick Donnelly wrote:

Okay, here's how I think it should work but it may not be how it actually works:

Fsrc: can readdir and cache results (dentries); multiple clients may have these caps
Fsr: can readdir; cannot cache results
Fsw: doesn't make sense
Fxrc: same as Fsrc; no reason to have Fx caps without w
(new) Fxw(b) == can create new dentries buffered

At least for directories, there seems to be no functional difference between Fs and Fr caps -- ditto Fx and Fw. IOW, the Fr/w flags on a dir seem to be entirely superfluous. Maybe we should eschew the r/w flags here since they just add confusion?

Now, I haven't learned exactly how a client/MDS knows that a client's view of a directory is "complete". For Fxwb, that's important since the client doesn't want to clobber files. However, that doesn't stop the client from getting these caps before the directory is complete. It just needs to go through the motions of loading all the dentries OR getattr()ing the dentries it wants to create ahead of time (important when the directory is large!).

The client is what determines "completeness" on a directory, and it sets that flag after doing a (complete) readdir or when an inodestat indicates that a directory is empty.

In order to allow a buffered create, the client will need to ensure completeness on the directory, or it will need to have a valid negative dentry lease for the file being created (i.e. failed lookup where the MDS has granted a dentry lease on the result).

#19 Updated by Jeff Layton 11 months ago

Patrick Donnelly wrote:

Okay, here's how I think it should work but it may not be how it actually works:

Fsrc: can readdir and cache results (dentries); multiple clients may have these caps
Fsr: can readdir; cannot cache results
Fsw: doesn't make sense
Fxrc: same as Fsrc; no reason to have Fx caps without w
(new) Fxw(b) == can create new dentries buffered

Today, I'm fairly certain that the MDS only hands out Fsx caps (at most) on directories and there isn't a lot of difference between Fs and Fx on a directory in the current code (given that we can't buffer up any sort of directory level change currently).

Given that r/w caps don't really have much meaning on a directory, I move that we don't hand them out, period.

That just leaves Fbc -- given that there isn't a lot of difference between Fx and Fs, what benefit will we derive from adding Fbc them into the mix here? In what situations would we give a client Fs, but not Fsc? Ditto for Fx and Fxb?

#20 Updated by Greg Farnum 11 months ago

Given that we already use non-cap flags, and directories are special anyway, I'm not sure extending the cap language to cover this is the way to go.
I mean, it might be! But I think if a directory is complete, and the client has Fx on it, you already know what you need in order to buffer creates on it. And adding in other caps in that case is just a recipe for confusion unless we figure out a need for more granularity.

#21 Updated by Jeff Layton 11 months ago

That's sort of my point here (though I didn't put it quite as succinctly). I don't think adding more cap flags really helps us here. If we've granted Fx on a directory, then that seems like it ought to be sufficient to allow the client to buffer creates.

That said, now that I've read over Sage's comment, we maybe should just add in Fb, pro forma, and always grant and revoke that on directories together. Then if we ever did want to grant exclusive caps on the F metadata while denying buffered changes to a dir, we could start issuing them separately.

#22 Updated by Greg Farnum 11 months ago

Jeff Layton wrote:

That's sort of my point here (though I didn't put it quite as succinctly). I don't think adding more cap flags really helps us here. If we've granted Fx on a directory, then that seems like it ought to be sufficient to allow the client to buffer creates.

Well, as someone noted, you also need to make sure you aren't creating a dentry that already exists but that the client doesn't have cached. I believe that is separate from granting Fx caps.

That said, now that I've read over Sage's comment, we maybe should just add in Fb, pro forma, and always grant and revoke that on directories together. Then if we ever did want to grant exclusive caps on the F metadata while denying buffered changes to a dir, we could start issuing them separately.

Should probably draw out the cases here. My concern with adding Fb and a meaning to it is, how does that interact with the COMPLETE+ORDERED flags that the client already maintains now? Can you have Fb without those flags being set? Does one necessarily imply the other once they're separate? What happens if they somehow disagree, like because the client elects to trim some dentries from cache but still has the Fb cap from the MDS?

#23 Updated by Jeff Layton 11 months ago

Greg Farnum wrote:

Well, as someone noted, you also need to make sure you aren't creating a dentry that already exists but that the client doesn't have cached. I believe that is separate from granting Fx caps.

Yep, separate thing, but necessary.

Should probably draw out the cases here. My concern with adding Fb and a meaning to it is, how does that interact with the COMPLETE+ORDERED flags that the client already maintains now? Can you have Fb without those flags being set? Does one necessarily imply the other once they're separate? What happens if they somehow disagree, like because the client elects to trim some dentries from cache but still has the Fb cap from the MDS?

In order to buffer creates the client will need:

  1. an unused ino_t from a range delegated by the MDS
  2. Fb caps on the parent directory
  3. either I_COMPLETE on the parent directory or a lease on a negative dentry with the same name

#24 Updated by Zheng Yan 11 months ago

Jeff Layton wrote:

Zheng Yan wrote:

For directory inode, current mds only issues Fsx (at most) caps to client. It never issues Frwcb caps to client.

Fsx (x implies s) caps on a directory is different from Fs caps when client creates/unlinks file. If a client only has Fs caps, it has to release the Fs caps to MDS when creating/unlinking file. Losing Fs caps means dir lease becomes invalid. If a client has Fsx caps, it does not needs to release Fs/Fx caps when creating/unlinking file. Dir lease is still valid when create/unlink finishes.

Thanks Zheng. Yes, this interaction between dentry leases and caps is what I'm trying to sort out here, but this seems to be contrary to what Sage and Greg were saying on ceph-devel.

Maybe we need a concrete example:

Suppose we have a directory (/foo) with a file in it (/foo/bar). We have a Fs caps on /foo and a dentry lease on /foo/bar. Another client then creates a new file in that directory (/foo/baz). At that point the MDS revokes Fs caps from the first client. Does the dentry lease on /foo/bar become invalid at that point?

lease on /foo/bar is still valid after dir loses Fs caps

If so why? I thought the whole point of dentry leases was so that you could make changes to a parent dir without invalidating every dentry under it.

#25 Updated by Zheng Yan 11 months ago

Jeff Layton wrote:

Greg Farnum wrote:

Well, as someone noted, you also need to make sure you aren't creating a dentry that already exists but that the client doesn't have cached. I believe that is separate from granting Fx caps.

Yep, separate thing, but necessary.

Should probably draw out the cases here. My concern with adding Fb and a meaning to it is, how does that interact with the COMPLETE+ORDERED flags that the client already maintains now? Can you have Fb without those flags being set? Does one necessarily imply the other once they're separate? What happens if they somehow disagree, like because the client elects to trim some dentries from cache but still has the Fb cap from the MDS?

In order to buffer creates the client will need:

  1. an unused ino_t from a range delegated by the MDS
  2. Fb caps on the parent directory
  3. either I_COMPLETE on the parent directory or a lease on a negative dentry with the same name

good point.

#26 Updated by Jeff Layton 11 months ago

Some notes and status:

I've been going over the code and playing with cephfs-shell to create different cap handling scenarios. I have patches that teach the MDS to hand out Fb caps on directories (particularly, newly-created ones), and a client patch that helps prevent it from giving up Fb too quickly.

Simple test: with cephfs-shell create a directory, and then from a different cephfs-shell (different client) make a directory inside that directory. The first client will get full caps for the first directory created (caps=pAsxLsXsxFsxb). The second mkdir involves 3 separate (and synchronous) cap revoke messages to the first client. First for Ax caps, then for Fx, and finally for Fs.

My guess is that the first is for authentication on the parent, the second to allow a change to the directory and the last one to ensure that the client sees the changes. I wonder if we could improve performance here by consolidating some of those cap revokes (particularly Fsx)?

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

#27 Updated by Zheng Yan 11 months ago

Jeff Layton wrote:

Some notes and status:

I've been going over the code and playing with cephfs-shell to create different cap handling scenarios. I have patches that teach the MDS to hand out Fb caps on directories (particularly, newly-created ones), and a client patch that helps prevent it from giving up Fb too quickly.

Simple test: with cephfs-shell create a directory, and then from a different cephfs-shell (different client) make a directory inside that directory. The first client will get full caps for the first directory created (caps=pAsxLsXsxFsxb). The second mkdir involves 3 separate (and synchronous) cap revoke messages to the first client. First for Ax caps, then for Fx, and finally for Fs.

My guess is that the first is for authentication on the parent, the second to allow a change to the directory and the last one to ensure that the client sees the changes. I wonder if we could improve performance here by consolidating some of those cap revokes (particularly Fsx)?

Yes. But These caps are controlled by different lock.

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

I think Fx is enough for buffered created. Fx implies Fscbrw, Fs implies Fcr. when filelock is xlocked (for truncate/setlayout), only Fcb is allowed. For directory inode, there is no operation that requires xlocking filelock.

#28 Updated by Jeff Layton 11 months ago

Zheng Yan wrote:

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

I think Fx is enough for buffered created. Fx implies Fscbrw, Fs implies Fcr. when filelock is xlocked (for truncate/setlayout), only Fcb is allowed. For directory inode, there is no operation that requires xlocking filelock.

That's what I would think too.

Sage made the point on the mailing list that Fsx pertains to the metadata and Fcbrw is all about the data. That may be correct, but as a practical matter it makes no difference. You can't change the data without altering the metadata in some fashion (mtime/iversion at the very least), so just working with Fsx on directories is simpler.

#29 Updated by Patrick Donnelly 11 months ago

Jeff Layton wrote:

Zheng Yan wrote:

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

I think Fx is enough for buffered created. Fx implies Fscbrw, Fs implies Fcr. when filelock is xlocked (for truncate/setlayout), only Fcb is allowed. For directory inode, there is no operation that requires xlocking filelock.

That's what I would think too.

Sage made the point on the mailing list that Fsx pertains to the metadata and Fcbrw is all about the data. That may be correct, but as a practical matter it makes no difference. You can't change the data without altering the metadata in some fashion (mtime/iversion at the very least), so just working with Fsx on directories is simpler.

Agreed.

#30 Updated by Jeff Layton 11 months ago

Great, so let's do a minor revision on the rules above. In order to buffer creates the client will need:

  1. an unused ino_t from a range delegated by the MDS
  2. Fx caps on the parent directory
  3. either I_COMPLETE on the parent directory or a valid negative dentry with the same name

I'll drop the cap handling patches on directories that I've been playing with, as I think we probably already have the correct behavior there with the Fsx today.

#31 Updated by Jeff Layton 11 months ago

I have a couple of starter patches that add a new delegated_inos interval_set to session_info_t.

Questions at this point:

  1. How many to hand the client at a time? The default for mds_client_prealloc_inos is 1000. What I think we'll probably want to do is to move sets of 10-100 at a time from prealloc_inos into delegated_inos, and hand those off to the client. More that that would probably allow the client to buffer up too much at a time. We may need a tunable for this quantity until we get a feel for how best to do this.
  2. what messages should contain updated delegated_inos interval sets? I had originally thought MClientSesssion, but these things won't be useful until you have Fx caps on a directory. Maybe we should do this via MClientCaps and MClientReply? The MDS could allocate a set to the client the first time it hands out Fx caps on a directory. The MDS would then update that set periodically as the client issues creates that shrink the set.
  3. Should we allow to push out buffered inodes using the traditional CREATE/MKDIR MDS ops? I think we probably should as that's a nice interim step before we have to implement batched creation. It should also speed up file creation on clients as long as it isn't doing it too quickly.
  4. Do we also need some mechanism to return unused inos? My thinking is not at first, and that we'd just release them when a session is torn down.

...and now that I've gone over this list, I think I probably ought to start by teaching the client how to buffer unlinks when it has Fx caps on the containing directory. That doesn't require a delegated inode range, and should be simpler to implement, particularly if we start by just having it dribble out the unlinks with CEPH_MDS_OP_UNLINK/CEPH_MDS_OP_RMDIR.

#32 Updated by Zheng Yan 11 months ago

Jeff Layton wrote:

I have a couple of starter patches that add a new delegated_inos interval_set to session_info_t.

Questions at this point:

  1. How many to hand the client at a time? The default for mds_client_prealloc_inos is 1000. What I think we'll probably want to do is to move sets of 10-100 at a time from prealloc_inos into delegated_inos, and hand those off to the client. More that that would probably allow the client to buffer up too much at a time. We may need a tunable for this quantity until we get a feel for how best to do this.

agree

  1. what messages should contain updated delegated_inos interval sets? I had originally thought MClientSesssion, but these things won't be useful until you have Fx caps on a directory. Maybe we should do this via MClientCaps and MClientReply? The MDS could allocate a set to the client the first time it hands out Fx caps on a directory. The MDS would then update that set periodically as the client issues creates that shrink the set.

Maybe MClientReply of requests that allocated new inodes. delegated_inos should be per-session, not per-directory.

  1. Should we allow to push out buffered inodes using the traditional CREATE/MKDIR MDS ops? I think we probably should as that's a nice interim step before we have to implement batched creation. It should also speed up file creation on clients as long as it isn't doing it too quickly.

agree

  1. Do we also need some mechanism to return unused inos? My thinking is not at first, and that we'd just release them when a session is torn down.

agree

...and now that I've gone over this list, I think I probably ought to start by teaching the client how to buffer unlinks when it has Fx caps on the containing directory. That doesn't require a delegated inode range, and should be simpler to implement, particularly if we start by just having it dribble out the unlinks with CEPH_MDS_OP_UNLINK/CEPH_MDS_OP_RMDIR.

#33 Updated by Jeff Layton 11 months ago

  • Priority changed from Urgent to High

#34 Updated by Jeff Layton 11 months ago

  • Related to Feature #38951: implement buffered unlink in libcephfs added

#35 Updated by Jeff Layton 11 months ago

  • Subject changed from cephfs: improve file create performance by allocating inodes to clients to cephfs: improve file create performance buffering file create operations

#36 Updated by Jeff Layton 11 months ago

  • Related to Feature #39129: create mechanism to delegate ranges of inode numbers to client added

#37 Updated by Jeff Layton 2 months ago

I finally have a working patchset for this, and so far, the results are somewhat lackluster. I'm seeing about the same performance on an untar with this enabled vs. not. That said, my test rig is pretty crappy and not really suitable for perf testing, so it could just be bottlenecked elsewhere. I still need to roll a simple single threaded create() test too, which I'll do soon. so we can just test that synthetically.

#38 Updated by Jeff Layton 2 months ago

The patchset currently has a module option to enable this that defaults of "off". So I can do some apples to apples testing here. Simple create and remove script:

#!/bin/sh

TESTDIR=/mnt/cephfs/test-unlink.$$

mkdir $TESTDIR
stat $TESTDIR
echo "Creating files in $TESTDIR" 
time for i in `seq 1 10000`; do
    touch $TESTDIR/$i
done
echo "Starting rm" 
time rm -r $TESTDIR

With async dirops disabled:

$ ./test-async-dirops.sh 
  File: /mnt/cephfs/test-unlink.14324
  Size: 0             Blocks: 0          IO Block: 65536  directory
Device: 31h/49d    Inode: 1099512211341  Links: 2
Access: (0775/drwxrwxr-x)  Uid: ( 4447/ jlayton)   Gid: ( 4447/ jlayton)
Context: system_u:object_r:cephfs_t:s0
Access: 2019-12-11 11:40:11.829576684 -0500
Modify: 2019-12-11 11:40:11.829576684 -0500
Change: 2019-12-11 11:40:11.829576684 -0500
 Birth: 2019-12-11 11:40:11.829576684 -0500
Creating files in /mnt/cephfs/test-unlink.14324

real    0m55.559s
user    0m2.774s
sys    0m35.850s
Starting rm

real    0m17.007s
user    0m0.056s
sys    0m1.930s

With them enabled:

$ ./test-async-dirops.sh 
  File: /mnt/cephfs/test-unlink.4237
  Size: 0             Blocks: 0          IO Block: 65536  directory
Device: 31h/49d    Inode: 1099511960533  Links: 2
Access: (0775/drwxrwxr-x)  Uid: ( 4447/ jlayton)   Gid: ( 4447/ jlayton)
Context: system_u:object_r:cephfs_t:s0
Access: 2019-12-11 11:37:56.520652287 -0500
Modify: 2019-12-11 11:37:56.520652287 -0500
Change: 2019-12-11 11:37:56.520652287 -0500
 Birth: 2019-12-11 11:37:56.520652287 -0500
Creating files in /mnt/cephfs/test-unlink.4237

real    0m50.538s
user    0m2.550s
sys    0m35.043s
Starting rm

real    0m10.101s
user    0m0.007s
sys    0m3.689s

I wonder if we might get better create performance by ramping up the number of inodes we're delegating to the client. I suspect we may occasionally run out of delegated inodes and end up doing sync creates. I'll do some experimentation with that next.

#39 Updated by Jeff Layton 2 months ago

I've pushed the current patch stack to https://github.com/ceph/ceph-client/tree/wip-async-dirops .

It's still very much a wip at this point, but it does seem to be functional, assuming the directory you're mounting reports a ceph.dir.layout, and you're working against a MDS with my delegated inodes patchset and zheng's lock caching.

#40 Updated by Jeff Layton 2 months ago

Changed my test rig around a bit so I could give bluestore a LV backed by an SSD, and rebuilt the kernel w/o KASAN.

Baseline (with async dirops disabled):

Creating files in /mnt/cephfs/test-unlink.3131

real    0m16.988s
user    0m4.108s
sys    0m4.675s
Starting rm

real    0m17.882s
user    0m0.052s
sys    0m0.583s

With patchset:

Creating files in /mnt/cephfs/test-unlink.15433

real    0m15.702s
user    0m4.044s
sys    0m4.536s
Starting rm

real    0m10.669s
user    0m0.006s
sys    0m1.060s

I then changed the MDS to preallocate 10000 inodes at a time, which makes it delegate 1000 at a time to the client:

Creating files in /mnt/cephfs/test-unlink.25471

real    0m7.573s
user    0m3.581s
sys    0m3.725s
Starting rm

real    0m10.163s
user    0m0.003s
sys    0m1.355s

So maintaining a pipeline of available inode numbers looks crucial here. I may need to rework the ino delegation set to hand out more at a time and/or refill the client's pool more aggressively.

#41 Updated by Patrick Donnelly 2 months ago

The baseline performance is surprising I think. That's with the same MDS patches? The usual full tilt create/second rate for the MDS is 700 c/s. The baseline would be 14 seconds in that case if the client were perfectly parallelizing the creates. IOW, you almost reached that without any asynchronous ops! Perhaps the lock caching is helping here?

I'm really excited to see what numbers we'll see when we start batching creates/unlink!!

#42 Updated by Jeff Layton 2 months ago

Doing some testing today with xfstests, during generic/531 test, I saw some of these pop up in the kernel ring buffer:

[ 6133.546915] ceph_async_create_cb: err=0 deleg_ino=0x10000096d63 target=0x10000097525
[ 6133.548797] ceph_async_create_cb: err=0 deleg_ino=0x10000096d64 target=0x10000097526
[ 6133.550534] ceph_async_create_cb: err=0 deleg_ino=0x10000096d65 target=0x10000097527
[ 6133.554150] ceph_async_create_cb: err=0 deleg_ino=0x10000096d66 target=0x10000097528

Basically, we sent a delegated inode that got ignored. The MDS log showed this:

2019-12-13T14:28:22.789-0500 7f0add0de700  1 mds.a Updating MDS map to version 20 from mon.0
2019-12-13T14:28:26.791-0500 7f0add0de700  1 mds.a Updating MDS map to version 21 from mon.0
2019-12-13T14:28:30.795-0500 7f0add0de700  1 mds.a Updating MDS map to version 22 from mon.0
2019-12-13T14:28:34.838-0500 7f0add0de700  1 mds.a Updating MDS map to version 23 from mon.0
2019-12-13T14:28:35.183-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d5c and i allocated 0x100000972ef
2019-12-13T14:28:35.183-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d5c but mds.0 allocated 0x100000972ef
2019-12-13T14:28:35.184-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d5d and i allocated 0x1000009751f
2019-12-13T14:28:35.184-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d5d but mds.0 allocated 0x1000009751f
2019-12-13T14:28:35.184-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d5e and i allocated 0x10000097520
2019-12-13T14:28:35.184-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d5e but mds.0 allocated 0x10000097520
2019-12-13T14:28:35.186-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d5f and i allocated 0x10000097521
2019-12-13T14:28:35.186-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d5f but mds.0 allocated 0x10000097521
2019-12-13T14:28:35.186-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d60 and i allocated 0x10000097522
2019-12-13T14:28:35.186-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d60 but mds.0 allocated 0x10000097522
2019-12-13T14:28:35.187-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d61 and i allocated 0x10000097523
2019-12-13T14:28:35.187-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d61 but mds.0 allocated 0x10000097523
2019-12-13T14:28:35.188-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d62 and i allocated 0x10000097524
2019-12-13T14:28:35.188-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d62 but mds.0 allocated 0x10000097524
2019-12-13T14:28:35.189-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d63 and i allocated 0x10000097525
2019-12-13T14:28:35.189-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d63 but mds.0 allocated 0x10000097525
2019-12-13T14:28:35.189-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d64 and i allocated 0x10000097526
2019-12-13T14:28:35.189-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d64 but mds.0 allocated 0x10000097526
2019-12-13T14:28:35.190-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d65 and i allocated 0x10000097527
2019-12-13T14:28:35.190-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d65 but mds.0 allocated 0x10000097527
2019-12-13T14:28:35.191-0500 7f0add0de700  0 mds.0.server WARNING: client specified 0x10000096d66 and i allocated 0x10000097528
2019-12-13T14:28:35.191-0500 7f0add0de700 -1 log_channel(cluster) log [ERR] : client.4536 specified ino 0x10000096d66 but mds.0 allocated 0x10000097528

I'm not yet sure how this could occur. Next step is to change the MDS to dump out the prealloc and delegated interval sets when this occurs.

#43 Updated by Jeff Layton 2 months ago

Ok, I think I've figured out what's going on with the inode number reuse.

I changed the code to not remove ino_t entries from the xarray in the reply callback, and the problem basically went away. I think what's happening is that we're ending up with some replies being processed out of order, and after we remove an entry, it gets put back into the xarray and reused.

I need to think about how we can ensure that a set of inode numbers represents the state after a given inode was created.

#44 Updated by Zheng Yan 2 months ago

Patrick Donnelly wrote:

The baseline performance is surprising I think. That's with the same MDS patches? The usual full tilt create/second rate for the MDS is 700 c/s. T

In my local test cluster (two machine directly connected with 10G network, ssd metadata pool). Single mds can handle 5k create per second, 10k unlink per second.

he baseline would be 14 seconds in that case if the client were perfectly parallelizing the creates. IOW, you almost reached that without any asynchronous ops! Perhaps the lock caching is helping here?

I checked performance impact of lock cache. There is no obvious difference.

I'm really excited to see what numbers we'll see when we start batching creates/unlink!!

#45 Updated by Patrick Donnelly about 2 months ago

Zheng Yan wrote:

Patrick Donnelly wrote:

The baseline performance is surprising I think. That's with the same MDS patches? The usual full tilt create/second rate for the MDS is 700 c/s. T

In my local test cluster (two machine directly connected with 10G network, ssd metadata pool). Single mds can handle 5k create per second, 10k unlink per second.

I've never seen the MDS operate so quickly. What do your tests look like?

he baseline would be 14 seconds in that case if the client were perfectly parallelizing the creates. IOW, you almost reached that without any asynchronous ops! Perhaps the lock caching is helping here?

I checked performance impact of lock cache. There is no obvious difference.

Uh, I'm confused then.

#46 Updated by Jeff Layton about 2 months ago

Looking at my home-grown testcase, the results look pretty good, but an untarring a random kernel tarball is considerably less rosy.

Baseline (async dirops disabled):

[jlayton@cephfs cephfs]$ strace -c tar xf ~/linux-4.18.0-153.el8.jlayton.006.tar && strace -c rm -rf /mnt/cephfs/linux-4.18.0-153.el8.jlayton.006
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.62   16.166510         258     62426           openat
 27.55    8.627581          61    139790           write
  9.48    2.969676          44     66538           utimensat
  4.47    1.400045          22     62431           close
  3.85    1.207220          14     83816           read
  2.92    0.915349         221      4125         4 mkdirat
  0.04    0.011957         114       104           newfstatat
  0.04    0.011038         344        32           symlinkat
  0.02    0.006091         234        26           unlinkat
  0.00    0.001007        1007         1           execve
  0.00    0.000998          19        50           mmap
  0.00    0.000625          12        52         1 fstat
  0.00    0.000306          61         5           socket
  0.00    0.000215          43         5         4 connect
  0.00    0.000208          14        14           mprotect
  0.00    0.000156           9        16           lseek
  0.00    0.000140          23         6           sendto
  0.00    0.000125           9        13           poll
  0.00    0.000095          47         2         1 access
  0.00    0.000085          28         3           munmap
  0.00    0.000048          24         2           statfs
  0.00    0.000038           5         7           fcntl
  0.00    0.000035          11         3           brk
  0.00    0.000020           6         3           rt_sigaction
  0.00    0.000018           6         3           getpid
  0.00    0.000015           7         2         1 arch_prctl
  0.00    0.000007           7         1           rt_sigprocmask
  0.00    0.000007           7         1           prlimit64
  0.00    0.000005           5         1           set_tid_address
  0.00    0.000005           5         1           set_robust_list
  0.00    0.000000           0         2           umask
  0.00    0.000000           0         1           geteuid
------ ----------- ----------- --------- --------- ----------------
100.00   31.319625                419482        11 total
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.16    9.801302         147     66538           unlinkat
  8.70    1.090761          88     12376           getdents64
  4.51    0.565975          45     12416           close
  4.14    0.519559          62      8288           openat
  1.87    0.234869          11     20625           fcntl
  1.60    0.200786          48      4125           newfstatat
  0.99    0.124480          15      8288           fstat
  0.01    0.001327          25        52           brk
  0.01    0.000842         842         1           execve
  0.00    0.000257         257         1           fstatfs
  0.00    0.000235          29         8           mmap
  0.00    0.000065          16         4         1 lseek
  0.00    0.000064          64         1         1 access
  0.00    0.000059          14         4           read
  0.00    0.000034           8         4           mprotect
  0.00    0.000023          11         2         1 arch_prctl
  0.00    0.000019          19         1           lstat
  0.00    0.000012          12         1           ioctl
  0.00    0.000000           0         1           munmap
------ ----------- ----------- --------- --------- ----------------
100.00   12.540669                132736         3 total

With async dirops enabled:

[jlayton@cephfs cephfs]$ strace -c tar xf ~/linux-4.18.0-153.el8.jlayton.006.tar && strace -c rm -rf /mnt/cephfs/linux-4.18.0-153.el8.jlayton.006
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 42.38   16.982829         121    139790           write
 41.50   16.627749         266     62426           openat
  7.35    2.945424          44     66538           utimensat
  3.30    1.321361          15     83816           read
  3.25    1.303068          20     62431           close
  2.15    0.861871         208      4125           mkdirat
  0.03    0.010191         318        32           symlinkat
  0.02    0.009137          91       100           newfstatat
  0.01    0.003501         134        26           unlinkat
  0.00    0.001393        1393         1           execve
  0.00    0.001212          24        50           mmap
  0.00    0.000594          11        52         1 fstat
  0.00    0.000539         107         5           socket
  0.00    0.000337          67         5         4 connect
  0.00    0.000328          23        14           mprotect
  0.00    0.000141          10        13           poll
  0.00    0.000135           8        16           lseek
  0.00    0.000134          22         6           sendto
  0.00    0.000102          34         3           munmap
  0.00    0.000068          34         2         1 access
  0.00    0.000062          31         2           statfs
  0.00    0.000060           8         7           fcntl
  0.00    0.000034          11         3           getpid
  0.00    0.000033          11         3           brk
  0.00    0.000025           8         3           rt_sigaction
  0.00    0.000018           9         2         1 arch_prctl
  0.00    0.000016           8         2           umask
  0.00    0.000012          12         1           rt_sigprocmask
  0.00    0.000010          10         1           geteuid
  0.00    0.000008           8         1           set_tid_address
  0.00    0.000008           8         1           set_robust_list
  0.00    0.000008           8         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00   40.070408                419478         7 total
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.03    9.595375         144     66538           unlinkat
  9.31    1.145363          92     12376           getdents64
  4.38    0.538771          65      8288           openat
  3.36    0.413157          33     12416           close
  2.02    0.247924          12     20625           fcntl
  1.71    0.210171          50      4125           newfstatat
  1.15    0.141220          17      8288           fstat
  0.02    0.002140        2140         1           execve
  0.01    0.001279          24        52           brk
  0.00    0.000310          38         8           mmap
  0.00    0.000164          41         4           mprotect
  0.00    0.000105         105         1           fstatfs
  0.00    0.000094          23         4           read
  0.00    0.000066          16         4         1 lseek
  0.00    0.000037          37         1           munmap
  0.00    0.000034          34         1         1 access
  0.00    0.000025          12         2         1 arch_prctl
  0.00    0.000017          17         1           lstat
  0.00    0.000013          13         1           ioctl
------ ----------- ----------- --------- --------- ----------------
100.00   12.296265                132736         3 total

So, the average time in openat() is about the same, but the time in write() goes up. I suspect this represents the overhead to create the object on first write.

I wonder if we ought to fire off an async OSD operation to create the first backing object when we create an inode? Doing that in parallel with the async MDS create would make some sense.

#47 Updated by Zheng Yan about 1 month ago

Patrick Donnelly wrote:

Zheng Yan wrote:

Patrick Donnelly wrote:

The baseline performance is surprising I think. That's with the same MDS patches? The usual full tilt create/second rate for the MDS is 700 c/s. T

In my local test cluster (two machine directly connected with 10G network, ssd metadata pool). Single mds can handle 5k create per second, 10k unlink per second.

I've never seen the MDS operate so quickly. What do your tests look like?

run multiple instances of following program. (for in `seq 1 5`; do test_create testdir$i & true; done)

#define _FILE_OFFSET_BITS 64
#include <features.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <cephfs/libcephfs.h>

int main(int argc, char *argv[]) {

        struct ceph_mount_info *cmount = NULL;

        ceph_create(&cmount, "admin");
        ceph_conf_read_file(cmount, "ceph.conf");
        ceph_mount(cmount, NULL);

        ceph_chdir(cmount, argv[1]);

        int n = 64;
        if (argc > 2)
                n = atoi(argv[2]);

        for (int j = 0; j < n; j++) {
                char buf[128];
                sprintf(buf, "dir%d", j);
                int ret = ceph_mkdir(cmount, buf, 0755);
                if (ret < 0 && ret != -EEXIST) {
                        printf("ceph_mkdir fail %d\n", ret);
                }
        }

        int count = 0;
        time_t start = time(NULL);
        for (int j = 0; j < n; j++) {
                for (int i = 0; i < 10000; ++i) {
                        char buf[128];
                        sprintf(buf, "dir%d/file%d", j, i);
                        int fd = ceph_open(cmount, buf, O_CREAT|O_RDWR, 0755);
                        if (fd < 0) {
                                printf("ceph_open fail %d\n", fd);
                                break;
                        }
                        ceph_close(cmount, fd);
                        count++;
                        if (time(NULL) > start) {
                                printf("%d\n", count);
                                count = 0;
                                start = time(NULL);
                        }
                }
        }
        ceph_unmount(cmount);
        return 0;
}

he baseline would be 14 seconds in that case if the client were perfectly parallelizing the creates. IOW, you almost reached that without any asynchronous ops! Perhaps the lock caching is helping here?

I checked performance impact of lock cache. There is no obvious difference.

Uh, I'm confused then.

#48 Updated by Jeff Layton about 1 month ago

I built a tree based on 1e2fe722c41d4cc34094afb157b3eb06b4a50972, which is the commit just before the merge of Zheng's work. I had to doctor the kernel I've been using to test as the OCTOPUS feature bit is already enabled on the MDS side.

I see pretty similar numbers to what we're seeing with Zheng's set in place and without async dirops. There may be some performance improvement from the lock caching but it's not large:

$ ./test-async-dirops.sh 
  File: /mnt/cephfs/test-dirops.1758
  Size: 0             Blocks: 0          IO Block: 65536  directory
Device: 30h/48d    Inode: 1099511637850  Links: 2
Access: (0775/drwxrwxr-x)  Uid: ( 4447/ jlayton)   Gid: ( 4447/ jlayton)
Context: system_u:object_r:cephfs_t:s0
Access: 2020-01-03 06:08:20.054505500 -0500
Modify: 2020-01-03 06:08:20.054505500 -0500
Change: 2020-01-03 06:08:20.054505500 -0500
 Birth: 2020-01-03 06:08:20.054505500 -0500
Creating files in /mnt/cephfs/test-dirops.1758

real    0m16.288s
user    0m0.199s
sys    0m3.611s
Starting rm

real    0m15.133s
user    0m0.028s
sys    0m1.893s

strace -c tar xf ~/linux-4.18.0-153.el8.jlayton.006.tar && strace -c rm -rf /mnt/cephfs/linux-4.18.0-153.el8.jlayton.006
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.00   17.145058         274     62426           openat
 26.18    8.802461          62    139790           write
  8.86    2.978631          44     66538           utimensat
  7.12    2.393749          28     83816           read
  4.19    1.408896          22     62431           close
  2.57    0.862629         209      4125           mkdirat
  0.03    0.009136          91       100           newfstatat
  0.03    0.008987         280        32           symlinkat
  0.01    0.003826         147        26           unlinkat
  0.01    0.002327        2327         1           execve
  0.00    0.000968          19        50           mmap
  0.00    0.000547          10        52         1 fstat
  0.00    0.000433          86         5           socket
  0.00    0.000343          68         5         4 connect
  0.00    0.000205          14        14           mprotect
  0.00    0.000174          29         6           sendto
  0.00    0.000125           7        16           lseek
  0.00    0.000077          38         2         1 access
  0.00    0.000073           5        13           poll
  0.00    0.000054          18         3           brk
  0.00    0.000048          16         3           munmap
  0.00    0.000034          17         2         1 arch_prctl
  0.00    0.000019           2         7           fcntl
  0.00    0.000013           4         3           getpid
  0.00    0.000000           0         3           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         2           umask
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         2           statfs
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00   33.618813                419478         7 total
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.55   10.387026         156     66538           unlinkat
  8.90    1.162258          93     12376           getdents64
  3.94    0.513963          62      8288           openat
  3.09    0.403474          32     12416           close
  1.80    0.235409          11     20625           fcntl
  1.74    0.227779          55      4125           newfstatat
  0.95    0.124278          14      8288           fstat
  0.01    0.001449          27        52           brk
  0.01    0.000976         976         1           execve
  0.00    0.000234         234         1           fstatfs
  0.00    0.000186          23         8           mmap
  0.00    0.000096          24         4           mprotect
  0.00    0.000061          15         4         1 lseek
  0.00    0.000046          11         4           read
  0.00    0.000036          36         1         1 access
  0.00    0.000033          33         1           munmap
  0.00    0.000023          23         1           lstat
  0.00    0.000019           9         2         1 arch_prctl
  0.00    0.000013          13         1           ioctl
------ ----------- ----------- --------- --------- ----------------
100.00   13.057359                132736         3 total

#49 Updated by Patrick Donnelly about 1 month ago

Thanks for checking. I'll have to play around with this more myself.

#50 Updated by Jeff Layton about 1 month ago

Now that I look closer, I don't think strace c is measuring what we need. It's looking at CPU time in each syscall. The fact that the async one takes more CPU time is not surprising - it's doing more of the work in the context of the task that issued the syscall. In the sync case, the task spends more time sleeping to wait on the reply and a lot of the work is done in workqueue context.

I sat down and rolled a bpftrace script to look at several functions:

// open
kprobe:ceph_atomic_open {
    @start[tid] = nsecs;
}

kretprobe:ceph_atomic_open /@start[tid]/ {
    @open[comm] = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}

// write
kprobe:ceph_write_begin {
    @start[tid] = nsecs;
}

kretprobe:ceph_write_end /@start[tid]/ {
    @write[comm] = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}

// mkdir
kprobe:ceph_mkdir {
    @start[tid] = nsecs;
}

kretprobe:ceph_mkdir /@start[tid]/ {
    @mkdir[comm] = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}

// unlink
kprobe:ceph_unlink {
    @start[tid] = nsecs;
}

kretprobe:ceph_unlink /@start[tid]/ {
    @unlink[comm] = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}

With async dirops disabled, and running a script to create and write to 10000 files in a dir and then rm -rf it:

@mkdir[mkdir]: 
[1M, 2M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@open[test-async-diro]: 
[256K, 512K)           4 |                                                    |
[512K, 1M)          9758 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1M, 2M)             198 |@                                                   |
[2M, 4M)              18 |                                                    |
[4M, 8M)               6 |                                                    |
[8M, 16M)              1 |                                                    |
[16M, 32M)             2 |                                                    |
[32M, 64M)             1 |                                                    |
[64M, 128M)            8 |                                                    |
[128M, 256M)           3 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               1 |                                                    |

@unlink[rm]: 
[256K, 512K)           3 |                                                    |
[512K, 1M)          9005 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1M, 2M)             927 |@@@@@                                               |
[2M, 4M)              45 |                                                    |
[4M, 8M)               6 |                                                    |
[8M, 16M)              5 |                                                    |
[16M, 32M)             6 |                                                    |
[32M, 64M)             0 |                                                    |
[64M, 128M)            1 |                                                    |
[128M, 256M)           1 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             2 |                                                    |

@write[test-async-diro]: 
[2K, 4K)             756 |@@@@@                                               |
[4K, 8K)            7445 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)           1603 |@@@@@@@@@@@                                         |
[16K, 32K)           182 |@                                                   |
[32K, 64K)            10 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           4 |                                                    |

With async dirops enabled:

@mkdir[mkdir]: 
[8M, 16M)              1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@open[test-async-diro]: 
[8K, 16K)             39 |                                                    |
[16K, 32K)           472 |@@@                                                 |
[32K, 64K)          2732 |@@@@@@@@@@@@@@@@@@@@@                               |
[64K, 128K)         6557 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[128K, 256K)         179 |@                                                   |
[256K, 512K)           7 |                                                    |
[512K, 1M)             1 |                                                    |
[1M, 2M)               3 |                                                    |
[2M, 4M)               4 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)             0 |                                                    |
[64M, 128M)            0 |                                                    |
[128M, 256M)           6 |                                                    |

@unlink[rm]: 
[1K, 2K)            4655 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)            4319 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[4K, 8K)             942 |@@@@@@@@@@                                          |
[8K, 16K)             62 |                                                    |
[16K, 32K)            13 |                                                    |
[32K, 64K)             1 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           2 |                                                    |
[512K, 1M)             2 |                                                    |
[1M, 2M)               3 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               1 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)             0 |                                                    |
[64M, 128M)            0 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)               1 |                                                    |

@write[test-async-diro]: 
[2K, 4K)             525 |@@@@@                                               |
[4K, 8K)            5460 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)           3846 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[16K, 32K)           148 |@                                                   |
[32K, 64K)             9 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           3 |                                                    |
[256K, 512K)           9 |                                                    |

unlink and open are clearly running faster with the async ops enabled, but the create phase still took longer to run. I'm now trying to determine where the extra time is incurred.

#51 Updated by Jeff Layton about 1 month ago

I can use strace to get timing statistics on individual calls though. With async dirops disabled:

1758  08:23:12.255393 openat(AT_FDCWD, "/mnt/cephfs/test-dirops.1758/9", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 <0.000800>
1758  08:23:12.256265 fcntl(1, F_GETFD) = 0 <0.000020>
1758  08:23:12.256309 fcntl(1, F_DUPFD, 10) = 10 <0.000011>
1758  08:23:12.256338 fcntl(1, F_GETFD) = 0 <0.000010>
1758  08:23:12.256365 fcntl(10, F_SETFD, FD_CLOEXEC) = 0 <0.000010>
1758  08:23:12.256393 dup2(3, 1)        = 1 <0.000011>
1758  08:23:12.256420 close(3)          = 0 <0.000015>
1758  08:23:12.256453 write(1, "foobarbaz\n", 10) = 10 <0.000022>
1758  08:23:12.256496 dup2(10, 1)       = 1 <0.000011>
1758  08:23:12.256525 fcntl(10, F_GETFD) = 0x1 (flags FD_CLOEXEC) <0.000011>
1758  08:23:12.256554 close(10)         = 0 <0.000011>
1758  08:23:12.256585 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 <0.000011>
1758  08:23:12.256618 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000011>

...with them enabled:

1766  08:23:22.839718 openat(AT_FDCWD, "/mnt/cephfs/test-dirops.1766/9", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 <0.000086>
1766  08:23:22.839832 fcntl(1, F_GETFD) = 0 <0.000011>
1766  08:23:22.839862 fcntl(1, F_DUPFD, 10) = 10 <0.000010>
1766  08:23:22.839889 fcntl(1, F_GETFD) = 0 <0.000010>
1766  08:23:22.839917 fcntl(10, F_SETFD, FD_CLOEXEC) = 0 <0.000010>
1766  08:23:22.839945 dup2(3, 1)        = 1 <0.000010>
1766  08:23:22.839972 close(3)          = 0 <0.000010>
1766  08:23:22.839999 write(1, "foobarbaz\n", 10) = 10 <0.000728>
1766  08:23:22.840775 dup2(10, 1)       = 1 <0.000019>
1766  08:23:22.840818 fcntl(10, F_GETFD) = 0x1 (flags FD_CLOEXEC) <0.000010>
1766  08:23:22.840847 close(10)         = 0 <0.000010>
1766  08:23:22.840880 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 <0.000011>
1766  08:23:22.840915 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000011>

...so the open is faster by ~10x, but the follow-on write to the file is slower by ~10x.

#52 Updated by Jeff Layton about 1 month ago

dynamic debugging from the client, with async dirops disabled. This is during the write calls:

[ 6538.454077] ceph:  check_caps 000000009b49ae65 file_want pAsxXsxFxwb used Fcb dirty Fw flushing - issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl
[ 6538.454079] ceph:   mds0 cap 000000009ef327b2 used Fcb issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
[ 6538.454084] ceph:  release inode 000000009b49ae65 regular file 000000001c161912
[ 6538.454086] ceph:  put_fmode 000000009b49ae65 fmode 2 {0,0,0,0}
[ 6538.454086] ceph:  __cap_delay_cancel 000000009b49ae65
[ 6538.454088] ceph:  __ceph_caps_issued 000000009b49ae65 cap 000000009ef327b2 issued pAsxLsXsxFsxcrwb
[ 6538.454091] ceph:  check_caps 000000009b49ae65 file_want - used Fcb dirty Fw flushing - issued pAsxLsXsxFsxcrwb revoking - retain pAsLsXsFscb
[ 6538.454092] ceph:   mds0 cap 000000009ef327b2 used Fcb issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
[ 6538.454094] ceph:   delaying issued pAsxLsXsxFsxcrwb -> pAsLsXsFscb, wanted pAsxXsxFxwb -> -
[ 6538.454096] ceph:  __cap_delay_requeue 000000009b49ae65 flags 56 at 4301265988
[ 6538.454127] ceph:  do_getattr inode 00000000902466a6 mask As mode 040755
[ 6538.454128] ceph:  __ceph_caps_issued_mask ino 0x1 cap 00000000fe3c061b issued pAsLsXs (mask As)
[ 6538.454130] ceph:  __touch_cap 00000000902466a6 cap 00000000fe3c061b mds0
[ 6538.454132] ceph:  d_revalidate 000000004dbe306e 'test-dirops.1881' inode 00000000287f9981 offset 0x0
[ 6538.454134] ceph:  dentry_lease_is_valid - dentry 000000004dbe306e = 1
[ 6538.454135] ceph:  d_revalidate 000000004dbe306e valid
[ 6538.454136] ceph:  do_getattr inode 00000000287f9981 mask As mode 040775
[ 6538.454138] ceph:  __ceph_caps_issued_mask ino 0x10000004f97 cap 00000000a64e2655 issued pAsLsXsFsxc (mask As)
[ 6538.454139] ceph:  __touch_cap 00000000287f9981 cap 00000000a64e2655 mds0
[ 6538.454142] ceph:  do_getattr inode 00000000287f9981 mask As mode 040775
[ 6538.454144] ceph:  __ceph_caps_issued_mask ino 0x10000004f97 cap 00000000a64e2655 issued pAsLsXsFsxc (mask As)
[ 6538.454145] ceph:  __touch_cap 00000000287f9981 cap 00000000a64e2655 mds0

...and with them enabled:

[ 6570.290082] ceph:  check_caps 00000000e0ef23c9 file_want pAsxXsxFxwb used Fcb dirty Fw flushing - issued pAsxLsXsxFsxcrwb revoking - retain pAsxLsxXsxFsxcrwbl
[ 6570.290084] ceph:   mds0 cap 00000000314ddffb used Fcb issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
[ 6570.290092] ceph:  release inode 00000000e0ef23c9 regular file 00000000a098993b
[ 6570.290094] ceph:  put_fmode 00000000e0ef23c9 fmode 2 {1,0,1,0}
[ 6570.290133] ceph:  do_getattr inode 00000000902466a6 mask As mode 040755
[ 6570.290135] ceph:  __ceph_caps_issued_mask ino 0x1 cap 00000000fe3c061b issued pAsLsXs (mask As)
[ 6570.290137] ceph:  __touch_cap 00000000902466a6 cap 00000000fe3c061b mds0
[ 6570.290140] ceph:  d_revalidate 0000000041af3820 'test-dirops.1891' inode 00000000287f9981 offset 0x0
[ 6570.290141] ceph:  dentry_lease_is_valid - dentry 0000000041af3820 = 1
[ 6570.290142] ceph:  d_revalidate 0000000041af3820 valid
[ 6570.290144] ceph:  do_getattr inode 00000000287f9981 mask As mode 040775
[ 6570.290146] ceph:  __ceph_caps_issued_mask ino 0x10000004f9d cap 00000000a64e2655 issued pAsLsXsFsxc (mask As)
[ 6570.290147] ceph:  __touch_cap 00000000287f9981 cap 00000000a64e2655 mds0
[ 6570.290150] ceph:  do_getattr inode 00000000287f9981 mask As mode 040775
[ 6570.290152] ceph:  __ceph_caps_issued_mask ino 0x10000004f9d cap 00000000a64e2655 issued pAsLsXsFsxc (mask As)
[ 6570.290153] ceph:  __touch_cap 00000000287f9981 cap 00000000a64e2655 mds0

With them disabled, the client does this between the put_fmode and do_getattr:

[ 6538.454086] ceph:  __cap_delay_cancel 000000009b49ae65
[ 6538.454088] ceph:  __ceph_caps_issued 000000009b49ae65 cap 000000009ef327b2 issued pAsxLsXsxFsxcrwb
[ 6538.454091] ceph:  check_caps 000000009b49ae65 file_want - used Fcb dirty Fw flushing - issued pAsxLsXsxFsxcrwb revoking - retain pAsLsXsFscb
[ 6538.454092] ceph:   mds0 cap 000000009ef327b2 used Fcb issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
[ 6538.454094] ceph:   delaying issued pAsxLsXsxFsxcrwb -> pAsLsXsFscb, wanted pAsxXsxFxwb -> -
[ 6538.454096] ceph:  __cap_delay_requeue 000000009b49ae65 flags 56 at 4301265988

I suspect this may account for the difference, but I'm not clear on why it's happening yet.

#53 Updated by Zheng Yan about 1 month ago

For current async create code. ceph_mds_reply_inode::max_size is 0. client can't write to the new file until it gets unsafe reply. Also see https://github.com/ceph/ceph/blob/master/src/mds/Server.cc#L4355

#55 Updated by Jeff Layton about 1 month ago

I threw in a hack to do this:

+       in.truncate_size = cpu_to_le64(-1);
+       in.max_size = cpu_to_le64(4194304);

I suspect we'd want to save these from the first sync create as well?

Still, this doesn't seem to universally help. The write to the first async created file is fast, but then all of the others still go slow again. I'll do some more digging to see what's going on.

#56 Updated by Jeff Layton about 1 month ago

At this point, I'm 90% sure the problem is in xattrs. Basically, after creating the file async we're leaving the i_xattr field zeroed out, which cues the client to believe that it doesn't know the status of the xattrs. Looking at how to fake that up now.

#57 Updated by Jeff Layton about 1 month ago

Yes, setting a zero length i_xattrs buffer on the new inode seems to have corrected the problem. I believe what was happening was that the kernel was doing a getxattr on "security.capability" on every write to see if it needed to zero out the field, so we'd end up doing a getattr before the first write.

Since we know a newly-created inode won't have this set, we can just ensure that the new inode is set up with a zero-length set of cached xattrs.

The latest branch has this fixed. I've also incorporated Zheng's suggestion to set the max_size and truncate_size parameters as well. I'm copying those from the values that the MDS sent in the first create in the dir.

This is my current working branch, though it still needs some more polish before it's ready to post:

https://github.com/ceph/ceph-client/commits/wip-async-dirops

#58 Updated by Zheng Yan about 1 month ago

#59 Updated by Jeff Layton about 1 month ago

Zheng Yan wrote:

please add a flag that tell if a request is async.
https://github.com/ukernel/ceph/commit/54f6bbdc85505ddea21583e9c237f7036393a125

Can you explain why that's needed? The only real difference to the MDS is that the create request is specifying the inode number. Why should it care whether the client has decided to wait on the reply or not?

#60 Updated by Zheng Yan about 1 month ago

mainly for wait_for_create_inode() function in MDS. Also make mds print error if it failed to handle async request.

#61 Updated by Jeff Layton about 1 month ago

Zheng Yan wrote:

mainly for wait_for_create_inode() function in MDS. Also make mds print error if it failed to handle async request.

Why exactly do we need to wait_for_create_inode()? Is the concern that a subsequent request (maybe a SETATTR) could be done while we're still in the middle of handling the async create? I thought the MDS' guaranteed ordering of requests would ensure that that wouldn't be an issue?

#62 Updated by Jeff Layton about 1 month ago

There's a potential problem I spotted today with copying the layouts from the first synchronous create.

Suppose we do a synchronous create in a directory that inherits its layout from its parent. The client gets back DIR_CREATE caps, and some inode numbers, and copies off the layout. Another client then changes the layout on the directory's parent such that new inodes will get a new layout.

The client however has already saved off a different layout and the new files will end up being created by the MDS with a mismatched layout. I think we probably need to have the client vet that the layout is as expected when the create reply comes in, and set a writeback error on the inode (and maybe ensure that further reads and writes on the fd fail).

We should probably have the MDS revoke Dc caps for the entire subtree under a directory when its layout is changed. That sounds a little ugly, but I think it's probably necessary.

That last bit sounds a little ugly, tbh, but I think it's probably necessary. The lat

#63 Updated by Patrick Donnelly about 1 month ago

  • Status changed from New to In Progress

Jeff Layton wrote:

There's a potential problem I spotted today with copying the layouts from the first synchronous create.

Suppose we do a synchronous create in a directory that inherits its layout from its parent. The client gets back DIR_CREATE caps, and some inode numbers, and copies off the layout. Another client then changes the layout on the directory's parent such that new inodes will get a new layout.

The client however has already saved off a different layout and the new files will end up being created by the MDS with a mismatched layout. I think we probably need to have the client vet that the layout is as expected when the create reply comes in, and set a writeback error on the inode (and maybe ensure that further reads and writes on the fd fail).

We should probably have the MDS revoke Dc caps for the entire subtree under a directory when its layout is changed. That sounds a little ugly, but I think it's probably necessary.

Zheng explained to me in some prior thread that the MDS does revoke the async dir caps when the layout is changed. In that way, when the kernel client tries to reacquire the caps, it should see the new layout.

#64 Updated by Jeff Layton about 1 month ago

Great! I'll still plan to add in a sanity check for this in the client too.

#65 Updated by Zheng Yan about 1 month ago

Jeff Layton wrote:

Zheng Yan wrote:

mainly for wait_for_create_inode() function in MDS. Also make mds print error if it failed to handle async request.

Why exactly do we need to wait_for_create_inode()? Is the concern that a subsequent request (maybe a SETATTR) could be done while we're still in the middle of handling the async create?

yes, for subsequent setattr and cap message.

I thought the MDS' guaranteed ordering of requests would ensure that that wouldn't be an issue?

with lock cache for create, mds still needs to xlock a null dentry. If other client has lease on the null dentry, mds has to revoke it before creating new inode. So the ordering is not guaranteed.

#66 Updated by Zheng Yan about 1 month ago

Jeff Layton wrote:

Great! I'll still plan to add in a sanity check for this in the client too.

Patrick is right. Also see commit 'client: enable async create if have sufficient caps' in https://github.com/ceph/ceph/pull/32576

#67 Updated by Jeff Layton about 1 month ago

I still don't understand what value this flag adds. Why not just always have requests involving an inode wait on the create to be complete? In the synchronous case, that wait would (almost) always be a no-op anyway.

Also, it seems like it could happen with a synchronous create too. A client could issue a create for an inode nearly simultaneously with another client issuing a setattr to the same inode (maybe via NFS file-handle guessing attack, or something else).

#68 Updated by Zheng Yan about 1 month ago

OK, I move the wait to client side. See commit "client: wait for async creating before sending request or cap message" in https://github.com/ceph/ceph/pull/32576

#69 Updated by Jeff Layton 24 days ago

What I have is mostly working now, but I'm occasionally seeing an async create come back with -EEXIST when running xfstest generic/531. That test just creates and unlinks the same dentry repeatedly, and I suspect that sometimes we end up with an async create getting reordered in front of a delete. I have found that if I disable async unlinks, then the test runs OK, so that sort of confirms the theory.

I've tried a couple of different approaches to address the problem, but none seem to have worked so far. Most recently, I added some code to ensure that only one async op per dentry is in flight at a time, but it still didn't help. I suspect we may be ending up with some dentries being d_drop'ed, so that we end up with two dentries for the same parent+name -- one hashed and one not.

#70 Updated by Patrick Donnelly 23 days ago

  • Target version changed from v15.0.0 to v16.0.0

Also available in: Atom PDF