Project

General

Profile

Feature #39129

create mechanism to delegate ranges of inode numbers to client

Added by Jeff Layton 5 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

Create a mechanism by which we can hand out ranges of inode numbers to MDS clients. The clients can then use those to fully instantiate inodes in memory and then flush them back to the server.

We already allocate a range of inode numbers for each client in the prealloc_inos interval set inside the MDS. What may be easiest is to just hand out smaller pieces of that range to the client via some new mechanism.

I'm not sure if we need new messages for this, or whether we could extend some existing messages to contain the set. We probably would want these in MClientReply (so we could replenish the client when we are responding to a create). Maybe we could update the client via other mechanisms too? I'm not sure what would work best here yet.

For now, the client can just ignore these ranges.


Related issues

Related to fs - Feature #24461: cephfs: improve file create performance buffering file create operations New 06/08/2018
Related to fs - Feature #38951: implement buffered unlink in libcephfs New

History

#1 Updated by Jeff Layton 5 months ago

  • Related to Feature #24461: cephfs: improve file create performance buffering file create operations added

#2 Updated by Patrick Donnelly 4 months ago

  • Assignee set to Jeff Layton
  • Target version set to v15.0.0
  • Start date deleted (04/05/2019)

#3 Updated by Jeff Layton 4 months ago

  • Related to Feature #38951: implement buffered unlink in libcephfs added

#4 Updated by Jeff Layton 3 months ago

We may not need this after all. The kernel client at least doesn't care a lot about the inode number. We can do pretty much anything we want with the inode in memory, and leave inode->i_ino it set to 0 initially. When we get the CREATE reply, we can then fill out the inode number.

This does mean that we'll have to wait on the CREATE reply in order to do a stat(), or a statx() with STATX_INO, but that's probably fine. We'll also need to wait on that reply before we can flush dirty inode data to the OSDs, as we need to know the inode number in order to write to the objects. That said, we should be fine to write to the pagecache until that point.

#5 Updated by Patrick Donnelly 3 months ago

Jeff Layton wrote:

We may not need this after all. The kernel client at least doesn't care a lot about the inode number. We can do pretty much anything we want with the inode in memory, and leave inode->i_ino it set to 0 initially. When we get the CREATE reply, we can then fill out the inode number.

This does mean that we'll have to wait on the CREATE reply in order to do a stat(), or a statx() with STATX_INO, but that's probably fine. We'll also need to wait on that reply before we can flush dirty inode data to the OSDs, as we need to know the inode number in order to write to the objects. That said, we should be fine to write to the pagecache until that point.

That may actually be a better approach so that the MDS doesn't need to cleanup after us if the client fails.

#6 Updated by Jeff Layton 2 months ago

I think we are going to need this after all. If we don't do this, we'll have to delay writing to newly-created files until the create response comes in. We won't know the inode number until then and therefore we won't know what objects to write.

So I think we're back to figuring out to what calls we'd want to add inode range updates. Mostly we'll want to allocate these out when the MDS grants create caps on a directory, so maybe in MClientReply and MClientCaps messages?

I still don't have a good enough feel for the MDS session management code, so I'm happy to take advice here.

#7 Updated by Jeff Layton about 2 months ago

Going over the userland code today to see what's there and what can be reused. Some notes:

struct ceph_mds_request_head has this field:

   __le64 ino;                    /* use this ino for openc, mkdir, mknod, 
                                     etc. (if replaying) */                

Can we use this field to send the inode number for new creates as well? We will need for the MDS to recognize when we're specifying the inode number on a create, so we may need some sort of capability bit of something so that it knows to do that.

Alternately, if older clients always set that field to 0 when it's not used, then we may be able to rely on that, and use that as a "allocate me an inode number during the create" indicator.

Adding an interval_set to MClientCaps and MClientReply does seem like the most efficient way to hand the set to the client since we'll presumably need the MDS to regularly replenish it. But...it's a little weird in that the interval_set of inode numbers will be a client-wide property and those calls generally deal with a specific inode.

I'm still wondering if a new CEPH_SESSION_DELEGATED_INOS op would be better choice, and have the MDS just push those out to clients with at least one CEPH_CAP_DIR_CREATE cap when the client's set drops below a particular threshold. If the client can't get one because we're temporarily out, then we can always block and have it do a synchronous create.

Also, will we ever need to worry about recalling the set that we've handed out? I'm going to assume (for now) that the answer is no, and that we'll just release them when the session is torn down.

Also available in: Atom PDF