Project

General

Profile

Feature #24461

cephfs: improve file create performance buffering file create operations

Added by Patrick Donnelly over 1 year ago. Updated 7 months ago.

Status:
New
Priority:
High
Assignee:
Category:
Performance/Resource Usage
Target version:
Start date:
06/08/2018
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Client, MDS, kceph
Labels (FS):
task(hard), task(intern)
Pull request ID:

Description

Serialized single-client file creation (e.g. untar/rsync) is an area CephFS (and most distributed file systems) continues to be weak on. Improving this is difficult without removing the round-trip with the MDS. One possibility for allowing this is to allocate a block of inodes to the client to create new files with. The client may then asynchronously solidify the creation of those files. To do this, the client should have a new cap for directories (can we reuse CEPH_CAP_GWR?) which guarantees exclusive access to the directory.


Related issues

Related to fs - Feature #18477: O_TMPFILE support in libcephfs New 01/10/2017
Related to fs - Feature #38951: implement buffered unlink in libcephfs New
Related to fs - Feature #39129: create mechanism to delegate ranges of inode numbers to client New

History

#1 Updated by Jeff Layton over 1 year ago

Neat. NFS and SMB have directory delegations/leases, but I haven't studied the topic in detail.

So the idea is to preallocate anonymous inodes and grant them to the client, and then the client can just fill them out and add links for them in a directory where it has the appropriate caps? Done correctly, this might also be helpful for O_TMPFILE style anonymous creates as well, which would be a nice-to-have.

How will you handle the case where we start to fill out an anonymous inode before linking it into the directory, but then lose the GWR caps on the directory before you can link it in?

#2 Updated by Greg Farnum over 1 year ago

We've talked about this quite a lot in the past. I thought we had a tracker ticket for it, but on searching the most relevant thing I see is an old email archived at https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg27317.html

I think you'll find that file creates are just about the least scalable thing you can do on CephFS right now, though, so there is some easier ground. One obvious approach is to extend the current inode preallocation — it already allocates inodes per-client and has a fast path inside of the MDS for handing them back. It'd be great if clients were aware of that preallocation and could create files without waiting for the MDS to talk back to them! The issue with this is two-fold:

1) need to update the cap flushing protocol to deal with files newly created by the client
2) need to handle all the backtrace stuff normally performed by the MDS on file create (which still needs to happen, on either the client or the server)

There's also clean up in case of a client failure, but we've already got a model for that in how we figure out real file sizes and things based on max size.

#3 Updated by Patrick Donnelly over 1 year ago

  • Tracker changed from Bug to Feature

#4 Updated by Patrick Donnelly 8 months ago

  • Assignee set to Jeff Layton

#5 Updated by Jeff Layton 8 months ago

  • Target version changed from v14.0.0 to v15.0.0

#6 Updated by Jeff Layton 8 months ago

I think this is not really a single project, but a set of them. At a high level:

  • ensure that the MDS can hand out exclusive caps on directories, sufficient to allow dentries to be linked into it without involvement from the MDS (Patrick seemed to think it currently doesn't hand those caps out)
  • have MDS hand a range of inode numbers to a client when it has exclusive caps on the directory. Greg mentions above that we already preallocate ranges of inodes per-client, but the clients aren't aware of it. We'd need a way to inform the client of the range it is currently allowed to use (maybe extend MClientCaps?).
  • add a way for the the client to send buffered creates in batches. We'll probably need a CEPH_MDS_OP_CREATE_BATCH call for this (at a minimum)? If the client is sending a batched create, we'll probably want to assume that it starts out with full exclusive caps for any inode that it is creating (to better handle the untar use-case). Still, we'll have to handle the situation where the client crashes before it can send back the caps, so we will need to send full inode info in the batched create call.
  • a batched UNLINK call could be useful too, and may be an easier place to get started here.
  • It might be nice to consider the O_TMPFILE case here as well. We could allow the client to create unlinked, but open inodes, and then link them in after the fact. Possibly consider a batched LINK call as well?

A lot of questions still need to be resolved:

  • what should trigger the client to flush the buffered creates back to the server? An fsync on the directory and syncfs, obviously, and maybe when we exhaust our preallocated inode number range?
  • We'll also need to think about how many creates we can reasonably allow the client to buffer at a time. 10? 100? 1000? Maybe we'll want to use a window that increases exponentially as the client exhausts its range of numbers (suitably capped of course). Ideally, the reply to a batched create call would inform the client of the most current inode number range(s).
  • Do we need separate batched MKDIR and MKNOD calls too, or can we get away with a single, generic, CREATE call that sends the type?

#7 Updated by Patrick Donnelly 8 months ago

Jeff Layton wrote:

I think this is not really a single project, but a set of them. At a high level:

ensure that the MDS can hand out exclusive caps on directories, sufficient to allow dentries to be linked into it without involvement from the MDS (Patrick seemed to think it currently doesn't hand those caps out)

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

have MDS hand a range of inode numbers to a client when it has exclusive caps on the directory. Greg mentions above that we already preallocate ranges of inodes per-client, but the clients aren't aware of it. We'd need a way to inform the client of the range it is currently allowed to use (maybe extend MClientCaps?).
add a way for the the client to send buffered creates in batches. We'll probably need a CEPH_MDS_OP_CREATE_BATCH call for this (at a minimum)? If the client is sending a batched create, we'll probably want to assume that it starts out with full exclusive caps for any inode that it is creating (to better handle the untar use-case). Still, we'll have to handle the situation where the client crashes before it can send back the caps, so we will need to send full inode info in the batched create call.

Let's also be clear about the advantages of batched create: we obtain the necessary locks once (!) and fewer messages (Anything else?). If the client has exclusive caps for the directory inode, the batched create should also trivially obtain all the locks too.

I think it's reasonable that the batched file create should be per-directory to simplify locking. We should probably also require the client has WR|EXCL in order to use it.

Also, I think it should behave like openat, taking the directory inode, dentry name, and the inode #.

a batched UNLINK call could be useful too, and may be an easier place to get started here.

indeed!

It might be nice to consider the O_TMPFILE case here as well. We could allow the client to create unlinked, but open inodes, and then link them in after the fact. Possibly consider a batched LINK call as well?

Yes, that should be easy to fit in.

A lot of questions still need to be resolved:

what should trigger the client to flush the buffered creates back to the server? An fsync on the directory and syncfs, obviously, and maybe when we exhaust our preallocated inode number range?

Some tick in the client every ~5 seconds or preallocated inode pressure.

We'll also need to think about how many creates we can reasonably allow the client to buffer at a time. 10? 100? 1000? Maybe we'll want to use a window that increases exponentially as the client exhausts its range of numbers (suitably capped of course). Ideally, the reply to a batched create call would inform the client of the most current inode number range(s).

Whatever the limit is, we'll need to figure it out through experimentation. Obviously, there should be one but I think it should just be dictated by the MDS and not the client.

Do we need separate batched MKDIR and MKNOD calls too, or can we get away with a single, generic, CREATE call that sends the type?

Single generic CREATE is attractive.

#8 Updated by Jeff Layton 8 months ago

Patrick Donnelly wrote:

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

This is something that would be very nice to have clearly documented somewhere. The caps system is great, but it's somewhat difficult to tell what caps you actually need for particular operations.

Let's also be clear about the advantages of batched create: we obtain the necessary locks once (!) and fewer messages (Anything else?). If the client has exclusive caps for the directory inode, the batched create should also trivially obtain all the locks too.

I think it's reasonable that the batched file create should be per-directory to simplify locking. We should probably also require the client has WR|EXCL in order to use it.

Yes to both. I see no benefit to batching up creates across different parent directories.

Also, I think it should behave like openat, taking the directory inode, dentry name, and the inode #.

Not just the inode number. The MDS has to fully instantiate these inodes, so we'll need to send full inode info -- current size, c/mtime, layout info, etc.

  • It might be nice to consider the O_TMPFILE case here as well. We could allow the client to create unlinked, but open inodes, and then link them in after the fact. Possibly consider a batched LINK call as well?

Yes, that should be easy to fit in.

Note that we'd also need to handle O_TMPFILE in the case where the client can't get exclusive caps on the parent dir. That may complicate things a bit, but I think we can still do it.

A lot of questions still need to be resolved:

  • what should trigger the client to flush the buffered creates back to the server? An fsync on the directory and syncfs, obviously, and maybe when we exhaust our preallocated inode number range?

Some tick in the client every ~5 seconds or preallocated inode pressure.

Makes sense. We'll probably need some tunables until we get a better feel for this.

  • We'll also need to think about how many creates we can reasonably allow the client to buffer at a time. 10? 100? 1000? Maybe we'll want to use a window that increases exponentially as the client exhausts its range of numbers (suitably capped of course). Ideally, the reply to a batched create call would inform the client of the most current inode number range(s).

Whatever the limit is, we'll need to figure it out through experimentation. Obviously, there should be one but I think it should just be dictated by the MDS and not the client.

Absolutely.

  • Do we need separate batched MKDIR and MKNOD calls too, or can we get away with a single, generic, CREATE call that sends the type?

Single generic CREATE is attractive.

Agreed.

Another thing to consider:

What about nesting? If I mkdir -p foo/bar/baz, do we try to buffer them all or do we require flushing a parent dir back to the MDS before you can create new dentries in it? I'm inclined to allow buffering everything, but that means that we will always need to flush the parents back to the MDS before their children (which I think is OK).

#9 Updated by Jeff Layton 8 months ago

Some notes about the preallocation piece:

The prealloc_inos interval set is tracked per-session in session_info_t. It gets encoded when that object is encoded, but that seems to only occur between MDSs. That structure doesn't ever seem to be transmitted to the client.

I think we're going to have to assign the clients a different interval_set, in any case. The prealloc_inos interval set is for use by the MDS and we don't have a way to coordinate access to it with the client. I think we will need to allocate and track a separate range on a per-Session basis for this.

The MDS can't shrink this set unilaterally, so I think we'll want to dribble them out to the client in small chunks (max a hundred or so at a time at most?).

To communicate this range to the client, we could use a MClientSession message. Maybe version (somehow) and extend struct ceph_mds_session_head with the interval_set currently granted to the client? The MDS could then push updates to the client in any CEPH_SESSION_* message, and maybe we could add new calls. One the client can use to request a new set, and maybe a MDS->Client revoke function that tells it to flush everything and stop doing buffered creates.

#10 Updated by Patrick Donnelly 8 months ago

Jeff Layton wrote:

To communicate this range to the client, we could use a MClientSession message. Maybe version (somehow) and extend struct ceph_mds_session_head with the interval_set currently granted to the client? The MDS could then push updates to the client in any CEPH_SESSION_* message, and maybe we could add new calls. One the client can use to request a new set, and maybe a MDS->Client revoke function that tells it to flush everything and stop doing buffered creates.

It may also be useful to also communicate the current allocated set in MClientReply. (It should not shrink or invalidate previous allocations!)

#11 Updated by Patrick Donnelly 8 months ago

#12 Updated by Jeff Layton 8 months ago

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

Just so I'm clear...why do we care about WR caps here? If I have Fx caps on a directory, then does Fw carry any extra significance?

#13 Updated by Patrick Donnelly 8 months ago

Jeff Layton wrote:

EXCL and WR. I don't think the MDS ever considers handing out WR to clients.

Just so I'm clear...why do we care about WR caps here? If I have Fx caps on a directory, then does Fw carry any extra significance?

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

#14 Updated by Jeff Layton 8 months ago

Patrick Donnelly wrote:

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

I think so? Today I think Fx on a directory is functionally equivalent to Fs. It seems like it ought to be ok to allow the client to buffer up creates if it has Fx caps as no one should be able to get Fs until Fx is returned.

That said, I don't quite understand the distinction between Fx and Fw on normal files, so I could be wrong here.

#15 Updated by Patrick Donnelly 8 months ago

Jeff Layton wrote:

Patrick Donnelly wrote:

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

I think so? Today I think Fx on a directory is functionally equivalent to Fs. It seems like it ought to be ok to allow the client to buffer up creates if it has Fx caps as no one should be able to get Fs until Fx is returned.

Okay, here's how I think it should work but it may not be how it actually works:

Fsrc: can readdir and cache results (dentries); multiple clients may have these caps
Fsr: can readdir; cannot cache results
Fsw: doesn't make sense
Fxrc: same as Fsrc; no reason to have Fx caps without w
(new) Fxw(b) == can create new dentries buffered

Now, I haven't learned exactly how a client/MDS knows that a client's view of a directory is "complete". For Fxwb, that's important since the client doesn't want to clobber files. However, that doesn't stop the client from getting these caps before the directory is complete. It just needs to go through the motions of loading all the dentries OR getattr()ing the dentries it wants to create ahead of time (important when the directory is large!).

That said, I don't quite understand the distinction between Fx and Fw on normal files, so I could be wrong here.

I think we always give Fxwb and not just Fxw. So Fxwb would indicate the client can buffer writes. Fswb is impossible (unless we're doing LAZYIO but that's indicated with Fl instead).

#16 Updated by Zheng Yan 8 months ago

Jeff Layton wrote:

Patrick Donnelly wrote:

Fx would indicate the directory contents can be cached by the client, yes? Perhaps the client doesn't necessarily get permission to write in some circumstances.

I think so? Today I think Fx on a directory is functionally equivalent to Fs. It seems like it ought to be ok to allow the client to buffer up creates if it has Fx caps as no one should be able to get Fs until Fx is returned.

That said, I don't quite understand the distinction between Fx and Fw on normal files, so I could be wrong here.

For directory inode, current mds only issues Fsx (at most) caps to client. It never issues Frwcb caps to client.

Fsx (x implies s) caps on a directory is different from Fs caps when client creates/unlinks file. If a client only has Fs caps, it has to release the Fs caps to MDS when creating/unlinking file. Losing Fs caps means dir lease becomes invalid. If a client has Fsx caps, it does not needs to release Fs/Fx caps when creating/unlinking file. Dir lease is still valid when create/unlink finishes.

#17 Updated by Jeff Layton 8 months ago

Zheng Yan wrote:

For directory inode, current mds only issues Fsx (at most) caps to client. It never issues Frwcb caps to client.

Fsx (x implies s) caps on a directory is different from Fs caps when client creates/unlinks file. If a client only has Fs caps, it has to release the Fs caps to MDS when creating/unlinking file. Losing Fs caps means dir lease becomes invalid. If a client has Fsx caps, it does not needs to release Fs/Fx caps when creating/unlinking file. Dir lease is still valid when create/unlink finishes.

Thanks Zheng. Yes, this interaction between dentry leases and caps is what I'm trying to sort out here, but this seems to be contrary to what Sage and Greg were saying on ceph-devel.

Maybe we need a concrete example:

Suppose we have a directory (/foo) with a file in it (/foo/bar). We have a Fs caps on /foo and a dentry lease on /foo/bar. Another client then creates a new file in that directory (/foo/baz). At that point the MDS revokes Fs caps from the first client. Does the dentry lease on /foo/bar become invalid at that point?

If so why? I thought the whole point of dentry leases was so that you could make changes to a parent dir without invalidating every dentry under it.

#18 Updated by Jeff Layton 8 months ago

Patrick Donnelly wrote:

Okay, here's how I think it should work but it may not be how it actually works:

Fsrc: can readdir and cache results (dentries); multiple clients may have these caps
Fsr: can readdir; cannot cache results
Fsw: doesn't make sense
Fxrc: same as Fsrc; no reason to have Fx caps without w
(new) Fxw(b) == can create new dentries buffered

At least for directories, there seems to be no functional difference between Fs and Fr caps -- ditto Fx and Fw. IOW, the Fr/w flags on a dir seem to be entirely superfluous. Maybe we should eschew the r/w flags here since they just add confusion?

Now, I haven't learned exactly how a client/MDS knows that a client's view of a directory is "complete". For Fxwb, that's important since the client doesn't want to clobber files. However, that doesn't stop the client from getting these caps before the directory is complete. It just needs to go through the motions of loading all the dentries OR getattr()ing the dentries it wants to create ahead of time (important when the directory is large!).

The client is what determines "completeness" on a directory, and it sets that flag after doing a (complete) readdir or when an inodestat indicates that a directory is empty.

In order to allow a buffered create, the client will need to ensure completeness on the directory, or it will need to have a valid negative dentry lease for the file being created (i.e. failed lookup where the MDS has granted a dentry lease on the result).

#19 Updated by Jeff Layton 8 months ago

Patrick Donnelly wrote:

Okay, here's how I think it should work but it may not be how it actually works:

Fsrc: can readdir and cache results (dentries); multiple clients may have these caps
Fsr: can readdir; cannot cache results
Fsw: doesn't make sense
Fxrc: same as Fsrc; no reason to have Fx caps without w
(new) Fxw(b) == can create new dentries buffered

Today, I'm fairly certain that the MDS only hands out Fsx caps (at most) on directories and there isn't a lot of difference between Fs and Fx on a directory in the current code (given that we can't buffer up any sort of directory level change currently).

Given that r/w caps don't really have much meaning on a directory, I move that we don't hand them out, period.

That just leaves Fbc -- given that there isn't a lot of difference between Fx and Fs, what benefit will we derive from adding Fbc them into the mix here? In what situations would we give a client Fs, but not Fsc? Ditto for Fx and Fxb?

#20 Updated by Greg Farnum 7 months ago

Given that we already use non-cap flags, and directories are special anyway, I'm not sure extending the cap language to cover this is the way to go.
I mean, it might be! But I think if a directory is complete, and the client has Fx on it, you already know what you need in order to buffer creates on it. And adding in other caps in that case is just a recipe for confusion unless we figure out a need for more granularity.

#21 Updated by Jeff Layton 7 months ago

That's sort of my point here (though I didn't put it quite as succinctly). I don't think adding more cap flags really helps us here. If we've granted Fx on a directory, then that seems like it ought to be sufficient to allow the client to buffer creates.

That said, now that I've read over Sage's comment, we maybe should just add in Fb, pro forma, and always grant and revoke that on directories together. Then if we ever did want to grant exclusive caps on the F metadata while denying buffered changes to a dir, we could start issuing them separately.

#22 Updated by Greg Farnum 7 months ago

Jeff Layton wrote:

That's sort of my point here (though I didn't put it quite as succinctly). I don't think adding more cap flags really helps us here. If we've granted Fx on a directory, then that seems like it ought to be sufficient to allow the client to buffer creates.

Well, as someone noted, you also need to make sure you aren't creating a dentry that already exists but that the client doesn't have cached. I believe that is separate from granting Fx caps.

That said, now that I've read over Sage's comment, we maybe should just add in Fb, pro forma, and always grant and revoke that on directories together. Then if we ever did want to grant exclusive caps on the F metadata while denying buffered changes to a dir, we could start issuing them separately.

Should probably draw out the cases here. My concern with adding Fb and a meaning to it is, how does that interact with the COMPLETE+ORDERED flags that the client already maintains now? Can you have Fb without those flags being set? Does one necessarily imply the other once they're separate? What happens if they somehow disagree, like because the client elects to trim some dentries from cache but still has the Fb cap from the MDS?

#23 Updated by Jeff Layton 7 months ago

Greg Farnum wrote:

Well, as someone noted, you also need to make sure you aren't creating a dentry that already exists but that the client doesn't have cached. I believe that is separate from granting Fx caps.

Yep, separate thing, but necessary.

Should probably draw out the cases here. My concern with adding Fb and a meaning to it is, how does that interact with the COMPLETE+ORDERED flags that the client already maintains now? Can you have Fb without those flags being set? Does one necessarily imply the other once they're separate? What happens if they somehow disagree, like because the client elects to trim some dentries from cache but still has the Fb cap from the MDS?

In order to buffer creates the client will need:

  1. an unused ino_t from a range delegated by the MDS
  2. Fb caps on the parent directory
  3. either I_COMPLETE on the parent directory or a lease on a negative dentry with the same name

#24 Updated by Zheng Yan 7 months ago

Jeff Layton wrote:

Zheng Yan wrote:

For directory inode, current mds only issues Fsx (at most) caps to client. It never issues Frwcb caps to client.

Fsx (x implies s) caps on a directory is different from Fs caps when client creates/unlinks file. If a client only has Fs caps, it has to release the Fs caps to MDS when creating/unlinking file. Losing Fs caps means dir lease becomes invalid. If a client has Fsx caps, it does not needs to release Fs/Fx caps when creating/unlinking file. Dir lease is still valid when create/unlink finishes.

Thanks Zheng. Yes, this interaction between dentry leases and caps is what I'm trying to sort out here, but this seems to be contrary to what Sage and Greg were saying on ceph-devel.

Maybe we need a concrete example:

Suppose we have a directory (/foo) with a file in it (/foo/bar). We have a Fs caps on /foo and a dentry lease on /foo/bar. Another client then creates a new file in that directory (/foo/baz). At that point the MDS revokes Fs caps from the first client. Does the dentry lease on /foo/bar become invalid at that point?

lease on /foo/bar is still valid after dir loses Fs caps

If so why? I thought the whole point of dentry leases was so that you could make changes to a parent dir without invalidating every dentry under it.

#25 Updated by Zheng Yan 7 months ago

Jeff Layton wrote:

Greg Farnum wrote:

Well, as someone noted, you also need to make sure you aren't creating a dentry that already exists but that the client doesn't have cached. I believe that is separate from granting Fx caps.

Yep, separate thing, but necessary.

Should probably draw out the cases here. My concern with adding Fb and a meaning to it is, how does that interact with the COMPLETE+ORDERED flags that the client already maintains now? Can you have Fb without those flags being set? Does one necessarily imply the other once they're separate? What happens if they somehow disagree, like because the client elects to trim some dentries from cache but still has the Fb cap from the MDS?

In order to buffer creates the client will need:

  1. an unused ino_t from a range delegated by the MDS
  2. Fb caps on the parent directory
  3. either I_COMPLETE on the parent directory or a lease on a negative dentry with the same name

good point.

#26 Updated by Jeff Layton 7 months ago

Some notes and status:

I've been going over the code and playing with cephfs-shell to create different cap handling scenarios. I have patches that teach the MDS to hand out Fb caps on directories (particularly, newly-created ones), and a client patch that helps prevent it from giving up Fb too quickly.

Simple test: with cephfs-shell create a directory, and then from a different cephfs-shell (different client) make a directory inside that directory. The first client will get full caps for the first directory created (caps=pAsxLsXsxFsxb). The second mkdir involves 3 separate (and synchronous) cap revoke messages to the first client. First for Ax caps, then for Fx, and finally for Fs.

My guess is that the first is for authentication on the parent, the second to allow a change to the directory and the last one to ensure that the client sees the changes. I wonder if we could improve performance here by consolidating some of those cap revokes (particularly Fsx)?

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

#27 Updated by Zheng Yan 7 months ago

Jeff Layton wrote:

Some notes and status:

I've been going over the code and playing with cephfs-shell to create different cap handling scenarios. I have patches that teach the MDS to hand out Fb caps on directories (particularly, newly-created ones), and a client patch that helps prevent it from giving up Fb too quickly.

Simple test: with cephfs-shell create a directory, and then from a different cephfs-shell (different client) make a directory inside that directory. The first client will get full caps for the first directory created (caps=pAsxLsXsxFsxb). The second mkdir involves 3 separate (and synchronous) cap revoke messages to the first client. First for Ax caps, then for Fx, and finally for Fs.

My guess is that the first is for authentication on the parent, the second to allow a change to the directory and the last one to ensure that the client sees the changes. I wonder if we could improve performance here by consolidating some of those cap revokes (particularly Fsx)?

Yes. But These caps are controlled by different lock.

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

I think Fx is enough for buffered created. Fx implies Fscbrw, Fs implies Fcr. when filelock is xlocked (for truncate/setlayout), only Fcb is allowed. For directory inode, there is no operation that requires xlocking filelock.

#28 Updated by Jeff Layton 7 months ago

Zheng Yan wrote:

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

I think Fx is enough for buffered created. Fx implies Fscbrw, Fs implies Fcr. when filelock is xlocked (for truncate/setlayout), only Fcb is allowed. For directory inode, there is no operation that requires xlocking filelock.

That's what I would think too.

Sage made the point on the mailing list that Fsx pertains to the metadata and Fcbrw is all about the data. That may be correct, but as a practical matter it makes no difference. You can't change the data without altering the metadata in some fashion (mtime/iversion at the very least), so just working with Fsx on directories is simpler.

#29 Updated by Patrick Donnelly 7 months ago

Jeff Layton wrote:

Zheng Yan wrote:

On another note, I do have at least one minor concern about using Fb to indicate that buffered creates are allowed. We currently use Fs to indicate that the client is allowed to cache directory entries. Shouldn't that be based on Fc instead? Otherwise we are sort of breaking the semantic parity between Fc and Fb on directories.

I think Fx is enough for buffered created. Fx implies Fscbrw, Fs implies Fcr. when filelock is xlocked (for truncate/setlayout), only Fcb is allowed. For directory inode, there is no operation that requires xlocking filelock.

That's what I would think too.

Sage made the point on the mailing list that Fsx pertains to the metadata and Fcbrw is all about the data. That may be correct, but as a practical matter it makes no difference. You can't change the data without altering the metadata in some fashion (mtime/iversion at the very least), so just working with Fsx on directories is simpler.

Agreed.

#30 Updated by Jeff Layton 7 months ago

Great, so let's do a minor revision on the rules above. In order to buffer creates the client will need:

  1. an unused ino_t from a range delegated by the MDS
  2. Fx caps on the parent directory
  3. either I_COMPLETE on the parent directory or a valid negative dentry with the same name

I'll drop the cap handling patches on directories that I've been playing with, as I think we probably already have the correct behavior there with the Fsx today.

#31 Updated by Jeff Layton 7 months ago

I have a couple of starter patches that add a new delegated_inos interval_set to session_info_t.

Questions at this point:

  1. How many to hand the client at a time? The default for mds_client_prealloc_inos is 1000. What I think we'll probably want to do is to move sets of 10-100 at a time from prealloc_inos into delegated_inos, and hand those off to the client. More that that would probably allow the client to buffer up too much at a time. We may need a tunable for this quantity until we get a feel for how best to do this.
  2. what messages should contain updated delegated_inos interval sets? I had originally thought MClientSesssion, but these things won't be useful until you have Fx caps on a directory. Maybe we should do this via MClientCaps and MClientReply? The MDS could allocate a set to the client the first time it hands out Fx caps on a directory. The MDS would then update that set periodically as the client issues creates that shrink the set.
  3. Should we allow to push out buffered inodes using the traditional CREATE/MKDIR MDS ops? I think we probably should as that's a nice interim step before we have to implement batched creation. It should also speed up file creation on clients as long as it isn't doing it too quickly.
  4. Do we also need some mechanism to return unused inos? My thinking is not at first, and that we'd just release them when a session is torn down.

...and now that I've gone over this list, I think I probably ought to start by teaching the client how to buffer unlinks when it has Fx caps on the containing directory. That doesn't require a delegated inode range, and should be simpler to implement, particularly if we start by just having it dribble out the unlinks with CEPH_MDS_OP_UNLINK/CEPH_MDS_OP_RMDIR.

#32 Updated by Zheng Yan 7 months ago

Jeff Layton wrote:

I have a couple of starter patches that add a new delegated_inos interval_set to session_info_t.

Questions at this point:

  1. How many to hand the client at a time? The default for mds_client_prealloc_inos is 1000. What I think we'll probably want to do is to move sets of 10-100 at a time from prealloc_inos into delegated_inos, and hand those off to the client. More that that would probably allow the client to buffer up too much at a time. We may need a tunable for this quantity until we get a feel for how best to do this.

agree

  1. what messages should contain updated delegated_inos interval sets? I had originally thought MClientSesssion, but these things won't be useful until you have Fx caps on a directory. Maybe we should do this via MClientCaps and MClientReply? The MDS could allocate a set to the client the first time it hands out Fx caps on a directory. The MDS would then update that set periodically as the client issues creates that shrink the set.

Maybe MClientReply of requests that allocated new inodes. delegated_inos should be per-session, not per-directory.

  1. Should we allow to push out buffered inodes using the traditional CREATE/MKDIR MDS ops? I think we probably should as that's a nice interim step before we have to implement batched creation. It should also speed up file creation on clients as long as it isn't doing it too quickly.

agree

  1. Do we also need some mechanism to return unused inos? My thinking is not at first, and that we'd just release them when a session is torn down.

agree

...and now that I've gone over this list, I think I probably ought to start by teaching the client how to buffer unlinks when it has Fx caps on the containing directory. That doesn't require a delegated inode range, and should be simpler to implement, particularly if we start by just having it dribble out the unlinks with CEPH_MDS_OP_UNLINK/CEPH_MDS_OP_RMDIR.

#33 Updated by Jeff Layton 7 months ago

  • Priority changed from Urgent to High

#34 Updated by Jeff Layton 7 months ago

  • Related to Feature #38951: implement buffered unlink in libcephfs added

#35 Updated by Jeff Layton 7 months ago

  • Subject changed from cephfs: improve file create performance by allocating inodes to clients to cephfs: improve file create performance buffering file create operations

#36 Updated by Jeff Layton 7 months ago

  • Related to Feature #39129: create mechanism to delegate ranges of inode numbers to client added

Also available in: Atom PDF