Project

General

Profile

Feature #39129

create mechanism to delegate ranges of inode numbers to client

Added by Jeff Layton 7 months ago. Updated 28 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

Create a mechanism by which we can hand out ranges of inode numbers to MDS clients. The clients can then use those to fully instantiate inodes in memory and then flush them back to the server.

We already allocate a range of inode numbers for each client in the prealloc_inos interval set inside the MDS. What may be easiest is to just hand out smaller pieces of that range to the client via some new mechanism.

I'm not sure if we need new messages for this, or whether we could extend some existing messages to contain the set. We probably would want these in MClientReply (so we could replenish the client when we are responding to a create). Maybe we could update the client via other mechanisms too? I'm not sure what would work best here yet.

For now, the client can just ignore these ranges.


Related issues

Related to fs - Feature #24461: cephfs: improve file create performance buffering file create operations New 06/08/2018
Related to fs - Feature #38951: implement buffered unlink in libcephfs New

History

#1 Updated by Jeff Layton 7 months ago

  • Related to Feature #24461: cephfs: improve file create performance buffering file create operations added

#2 Updated by Patrick Donnelly 7 months ago

  • Assignee set to Jeff Layton
  • Target version set to v15.0.0
  • Start date deleted (04/05/2019)

#3 Updated by Jeff Layton 7 months ago

  • Related to Feature #38951: implement buffered unlink in libcephfs added

#4 Updated by Jeff Layton 6 months ago

We may not need this after all. The kernel client at least doesn't care a lot about the inode number. We can do pretty much anything we want with the inode in memory, and leave inode->i_ino it set to 0 initially. When we get the CREATE reply, we can then fill out the inode number.

This does mean that we'll have to wait on the CREATE reply in order to do a stat(), or a statx() with STATX_INO, but that's probably fine. We'll also need to wait on that reply before we can flush dirty inode data to the OSDs, as we need to know the inode number in order to write to the objects. That said, we should be fine to write to the pagecache until that point.

#5 Updated by Patrick Donnelly 6 months ago

Jeff Layton wrote:

We may not need this after all. The kernel client at least doesn't care a lot about the inode number. We can do pretty much anything we want with the inode in memory, and leave inode->i_ino it set to 0 initially. When we get the CREATE reply, we can then fill out the inode number.

This does mean that we'll have to wait on the CREATE reply in order to do a stat(), or a statx() with STATX_INO, but that's probably fine. We'll also need to wait on that reply before we can flush dirty inode data to the OSDs, as we need to know the inode number in order to write to the objects. That said, we should be fine to write to the pagecache until that point.

That may actually be a better approach so that the MDS doesn't need to cleanup after us if the client fails.

#6 Updated by Jeff Layton 5 months ago

I think we are going to need this after all. If we don't do this, we'll have to delay writing to newly-created files until the create response comes in. We won't know the inode number until then and therefore we won't know what objects to write.

So I think we're back to figuring out to what calls we'd want to add inode range updates. Mostly we'll want to allocate these out when the MDS grants create caps on a directory, so maybe in MClientReply and MClientCaps messages?

I still don't have a good enough feel for the MDS session management code, so I'm happy to take advice here.

#7 Updated by Jeff Layton 5 months ago

Going over the userland code today to see what's there and what can be reused. Some notes:

struct ceph_mds_request_head has this field:

   __le64 ino;                    /* use this ino for openc, mkdir, mknod, 
                                     etc. (if replaying) */                

Can we use this field to send the inode number for new creates as well? We will need for the MDS to recognize when we're specifying the inode number on a create, so we may need some sort of capability bit of something so that it knows to do that.

Alternately, if older clients always set that field to 0 when it's not used, then we may be able to rely on that, and use that as a "allocate me an inode number during the create" indicator.

Adding an interval_set to MClientCaps and MClientReply does seem like the most efficient way to hand the set to the client since we'll presumably need the MDS to regularly replenish it. But...it's a little weird in that the interval_set of inode numbers will be a client-wide property and those calls generally deal with a specific inode.

I'm still wondering if a new CEPH_SESSION_DELEGATED_INOS op would be better choice, and have the MDS just push those out to clients with at least one CEPH_CAP_DIR_CREATE cap when the client's set drops below a particular threshold. If the client can't get one because we're temporarily out, then we can always block and have it do a synchronous create.

Also, will we ever need to worry about recalling the set that we've handed out? I'm going to assume (for now) that the answer is no, and that we'll just release them when the session is torn down.

#8 Updated by Jeff Layton about 2 months ago

At a high level, here's what I think we need to do:

Add a new delegated_inos field to session_info_t in the MDS code. This will allocate ranges from prealloc_inos (or possibly could just grab a range of its own from the global pile). These will be advertised globally in the SessionMap, which should be enough to tell other MDS's about them.

From there, we'll need to communicate them to the clients. Probably, we can just extend the "extra" field in the mds request reply. CEPH_MDS_OP_CREATE already has an "extra" blob, and the fact that we transmit the length of the blob means that we should just be able to tack an interval_set onto the end. MKDIR, MKNOD and SYMLINK replies will need to transmit an "extra" blob as well that will just have the interval_set.

The kernel will need to grow the ability to decode and deal with interval_sets. I looked briefly over the code there, but didn't see any places where the kclient deals with interval_sets today. The kernel doesn't seem to have much in the way of infrastructure for dealing with interval_sets either, so we may need to add that or emulate it with IDA or something.

As the client creates new inodes asynchronously, it can assign them out of that set. ceph_mds_request_head has an ino field that is used for replays. I propose we use that field to also inform the MDS which inode value it should use for new requests as well. That is contingent on existing clients setting that to "0" on a new request. If any current ones don't then we may need to key that behavior off of a feature bit or something.

#9 Updated by Jeff Layton about 1 month ago

I've started working on patches to add this, but I see a potential problem. The idea is to delegate a range of inodes to a session, and to hand those out to the clients (probably in openc/mkdir/mknod/symlink replies). My thinking was to track these interval sets in the SessionMap (similarly to how prealloc_inos is tracked), but now I'm not sure that's sufficient.

What happens if the client grabs an inode from this set, and does an async create, but the session gets torn down for some reason before the openc request is sent? If the old session is purged from the sessionmap before the new session is established, then we don't really have a way to ensure continuity between the interval sets for each session.

This sort of tells me that hanging the set of delegated inodes off the session may be the wrong thing to do. Unfortunately, I'm not sure what the right thing to do is.

#10 Updated by Patrick Donnelly about 1 month ago

Jeff Layton wrote:

I've started working on patches to add this, but I see a potential problem. The idea is to delegate a range of inodes to a session, and to hand those out to the clients (probably in openc/mkdir/mknod/symlink replies). My thinking was to track these interval sets in the SessionMap (similarly to how prealloc_inos is tracked), but now I'm not sure that's sufficient.

What happens if the client grabs an inode from this set, and does an async create, but the session gets torn down for some reason before the openc request is sent? If the old session is purged from the sessionmap before the new session is established, then we don't really have a way to ensure continuity between the interval sets for each session.

This sort of tells me that hanging the set of delegated inodes off the session may be the wrong thing to do. Unfortunately, I'm not sure what the right thing to do is.

I haven't studied this bit of code but why can't we keep a closed/dead session in the SessionMap until all of its allocated inodes have been purged?

#11 Updated by Jeff Layton about 1 month ago

Patrick Donnelly wrote:

I haven't studied this bit of code but why can't we keep a closed/dead session in the SessionMap until all of its allocated inodes have been purged?

How would we purge inodes from a closed/dead session? Suppose the client drops off the net and never comes back -- at what point do we give up on it? I suppose we could consider ino_t's that were delegated to blacklisted clients to be null and void, but that seems a bit extreme.

To make this work too, we'd need to somehow allow a new session to take over the inode range that was previously held by the old one. Maybe we could consider using the reclaim interfaces we added for ganesha for this? Allow the client to reclaim a inode range from a previous version of itself?

#12 Updated by Patrick Donnelly about 1 month ago

Jeff Layton wrote:

Patrick Donnelly wrote:

I haven't studied this bit of code but why can't we keep a closed/dead session in the SessionMap until all of its allocated inodes have been purged?

How would we purge inodes from a closed/dead session? Suppose the client drops off the net and never comes back -- at what point do we give up on it? I suppose we could consider ino_t's that were delegated to blacklisted clients to be null and void, but that seems a bit extreme.

When the session is closed/blacklisted/evicted (or: the session cannot be reclaimed) we should clean up all of the inodes allocated to the client. Why is that extreme?

#13 Updated by Jeff Layton about 1 month ago

Patrick Donnelly wrote:

When the session is closed/blacklisted/evicted (or: the session cannot be reclaimed) we should clean up all of the inodes allocated to the client. Why is that extreme?

The plan was to not worry about adding a mechanism to allow the client to voluntarily return an inode number range. A client may hold no caps, but still have a set of inode numbers delegated to it. If such a client drops off the net, do we want to blacklist it?

#14 Updated by Patrick Donnelly about 1 month ago

Jeff Layton wrote:

Patrick Donnelly wrote:

When the session is closed/blacklisted/evicted (or: the session cannot be reclaimed) we should clean up all of the inodes allocated to the client. Why is that extreme?

The plan was to not worry about adding a mechanism to allow the client to voluntarily return an inode number range. A client may hold no caps, but still have a set of inode numbers delegated to it. If such a client drops off the net, do we want to blacklist it?

Generally, yes (that's how it works now?). Also, most clients hold at least some capabilities even if just on the root inode.

#15 Updated by Jeff Layton about 1 month ago

Made some progress this week after I got over the hump of going through the journaling. I now have a patchset that assigns an interval_set of inodeno_t's to a client session for delegation.

The next step is to make the mds hand these out to the clients. I'm still trying to settle on the best way to do that. Possibly in the "extra" reply info for create (and maybe mkdir)? I'll probably start there.

I think too that we're going to need to patch the userland client to track and use delegated inode numbers, if only to allow us to write tests for it. What we could do is teach the userland client to issue creates with delegated inode numbers when it has the caps, and just do them synchronously for now.

#16 Updated by Jeff Layton 28 days ago

I added a patch that makes the MDS encode it as part of the "extra" info that gets tacked onto the create reply. I then coded up a kernel patch that has it dump anything beyond the "ino" value in hex. This is what the interval_set looks like on the wire:

[ 3358.231129] delegated_inos:00000000: 01 00 00 00 02 00 00 00 00 01 00 00 64 00 00 00
[ 3358.231134] delegated_inos:00000010: 00 00 00 00

...and here's the corresponding range in the MDS log:

[0x10000000002~0x64]

...so, it looks like an interval_set is transmitted as an array of uint64_t pairs (start and length), preceded by a 32-bit array length. The catch is that most clients currently in the field will barf on an extra blob that is larger than they expect. I have a patch queued up to fix this, but we may need to make this conditional on a feature bit or something.

We also don't have an interval_set data structure in the kernel, so I'm thinking what I may do is use an IDA hash to store the range. That should be reasonably efficient, but we'll need to deal with the fact that the MDS could send us a range that includes an inode we've already used, so we may need to keep a second hash with the values in the range that are in flight.

#17 Updated by Patrick Donnelly 28 days ago

Jeff Layton wrote:

I added a patch that makes the MDS encode it as part of the "extra" info that gets tacked onto the create reply. I then coded up a kernel patch that has it dump anything beyond the "ino" value in hex. This is what the interval_set looks like on the wire:

[ 3358.231129] delegated_inos:00000000: 01 00 00 00 02 00 00 00 00 01 00 00 64 00 00 00
[ 3358.231134] delegated_inos:00000010: 00 00 00 00

...and here's the corresponding range in the MDS log:

[0x10000000002~0x64]

...so, it looks like an interval_set is transmitted as an array of uint64_t pairs (start and length), preceded by a 32-bit array length. The catch is that most clients currently in the field will barf on an extra blob that is larger than they expect. I have a patch queued up to fix this, but we may need to make this conditional on a feature bit or something.

I would suggest that the MDS only supply to a client one continuous range of delegated inodes. Then you only need a start/length.

#18 Updated by Jeff Layton 28 days ago

That would make it a little simpler, but it's not really that big a deal to track an arbitrary interval_set. We have an ID allocator facility in the kernel that is quite efficient (based on xarrays), especially when dealing with small contiguous ranges (as we'll mostly be doing here), and that also provides a way to keep track of which ones are still available for use.

Also available in: Atom PDF