cephfs: improve file create performance by allocating inodes to clients
Serialized single-client file creation (e.g. untar/rsync) is an area CephFS (and most distributed file systems) continues to be weak on. Improving this is difficult without removing the round-trip with the MDS. One possibility for allowing this is to allocate a block of inodes to the client to create new files with. The client may then asynchronously solidify the creation of those files. To do this, the client should have a new cap for directories (can we reuse CEPH_CAP_GWR?) which guarantees exclusive access to the directory.
#1 Updated by Jeff Layton 6 months ago
Neat. NFS and SMB have directory delegations/leases, but I haven't studied the topic in detail.
So the idea is to preallocate anonymous inodes and grant them to the client, and then the client can just fill them out and add links for them in a directory where it has the appropriate caps? Done correctly, this might also be helpful for O_TMPFILE style anonymous creates as well, which would be a nice-to-have.
How will you handle the case where we start to fill out an anonymous inode before linking it into the directory, but then lose the GWR caps on the directory before you can link it in?
#2 Updated by Greg Farnum 6 months ago
We've talked about this quite a lot in the past. I thought we had a tracker ticket for it, but on searching the most relevant thing I see is an old email archived at https://firstname.lastname@example.org/msg27317.html
I think you'll find that file creates are just about the least scalable thing you can do on CephFS right now, though, so there is some easier ground. One obvious approach is to extend the current inode preallocation — it already allocates inodes per-client and has a fast path inside of the MDS for handing them back. It'd be great if clients were aware of that preallocation and could create files without waiting for the MDS to talk back to them! The issue with this is two-fold:
1) need to update the cap flushing protocol to deal with files newly created by the client
2) need to handle all the backtrace stuff normally performed by the MDS on file create (which still needs to happen, on either the client or the server)
There's also clean up in case of a client failure, but we've already got a model for that in how we figure out real file sizes and things based on max size.