Feature #39129: create mechanism to delegate ranges of inode numbers to client - CephFS - Ceph

Actions

Copy link

Feature #39129

closed

create mechanism to delegate ranges of inode numbers to client

Added by Jeff Layton about 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

Target version:

Ceph - v15.0.0

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Component(FS):

MDS

Labels (FS):

Pull request ID:

31817

Description

Create a mechanism by which we can hand out ranges of inode numbers to MDS clients. The clients can then use those to fully instantiate inodes in memory and then flush them back to the server.

We already allocate a range of inode numbers for each client in the prealloc_inos interval set inside the MDS. What may be easiest is to just hand out smaller pieces of that range to the client via some new mechanism.

I'm not sure if we need new messages for this, or whether we could extend some existing messages to contain the set. We probably would want these in MClientReply (so we could replenish the client when we are responding to a create). Maybe we could update the client via other mechanisms too? I'm not sure what would work best here yet.

For now, the client can just ignore these ranges.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Jeff Layton about 5 years ago

Related to Feature #24461: cephfs: improve file create performance buffering file unlink/create operations added

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Assignee set to Jeff Layton
Target version set to v15.0.0
Start date deleted (~~04/05/2019~~)

Actions

Copy link

Updated by Jeff Layton almost 5 years ago

Related to Feature #38951: client: implement asynchronous unlink/create added

Actions

Copy link

Updated by Jeff Layton almost 5 years ago

We may not need this after all. The kernel client at least doesn't care a lot about the inode number. We can do pretty much anything we want with the inode in memory, and leave inode->i_ino it set to 0 initially. When we get the CREATE reply, we can then fill out the inode number.

This does mean that we'll have to wait on the CREATE reply in order to do a stat(), or a statx() with STATX_INO, but that's probably fine. We'll also need to wait on that reply before we can flush dirty inode data to the OSDs, as we need to know the inode number in order to write to the objects. That said, we should be fine to write to the pagecache until that point.

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Jeff Layton wrote:

We may not need this after all. The kernel client at least doesn't care a lot about the inode number. We can do pretty much anything we want with the inode in memory, and leave inode->i_ino it set to 0 initially. When we get the CREATE reply, we can then fill out the inode number.

This does mean that we'll have to wait on the CREATE reply in order to do a stat(), or a statx() with STATX_INO, but that's probably fine. We'll also need to wait on that reply before we can flush dirty inode data to the OSDs, as we need to know the inode number in order to write to the objects. That said, we should be fine to write to the pagecache until that point.

That may actually be a better approach so that the MDS doesn't need to cleanup after us if the client fails.

Actions

Copy link

Updated by Jeff Layton almost 5 years ago

I think we are going to need this after all. If we don't do this, we'll have to delay writing to newly-created files until the create response comes in. We won't know the inode number until then and therefore we won't know what objects to write.

So I think we're back to figuring out to what calls we'd want to add inode range updates. Mostly we'll want to allocate these out when the MDS grants create caps on a directory, so maybe in MClientReply and MClientCaps messages?

I still don't have a good enough feel for the MDS session management code, so I'm happy to take advice here.

Actions

Copy link

Updated by Jeff Layton almost 5 years ago

Going over the userland code today to see what's there and what can be reused. Some notes:

struct ceph_mds_request_head has this field:

   __le64 ino;                    /* use this ino for openc, mkdir, mknod, 
                                     etc. (if replaying) */

Can we use this field to send the inode number for new creates as well? We will need for the MDS to recognize when we're specifying the inode number on a create, so we may need some sort of capability bit of something so that it knows to do that.

Alternately, if older clients always set that field to 0 when it's not used, then we may be able to rely on that, and use that as a "allocate me an inode number during the create" indicator.

Adding an interval_set to MClientCaps and MClientReply does seem like the most efficient way to hand the set to the client since we'll presumably need the MDS to regularly replenish it. But...it's a little weird in that the interval_set of inode numbers will be a client-wide property and those calls generally deal with a specific inode.

I'm still wondering if a new CEPH_SESSION_DELEGATED_INOS op would be better choice, and have the MDS just push those out to clients with at least one CEPH_CAP_DIR_CREATE cap when the client's set drops below a particular threshold. If the client can't get one because we're temporarily out, then we can always block and have it do a synchronous create.

Also, will we ever need to worry about recalling the set that we've handed out? I'm going to assume (for now) that the answer is no, and that we'll just release them when the session is torn down.

Actions

Copy link

Updated by Jeff Layton over 4 years ago

At a high level, here's what I think we need to do:

Add a new delegated_inos field to session_info_t in the MDS code. This will allocate ranges from prealloc_inos (or possibly could just grab a range of its own from the global pile). These will be advertised globally in the SessionMap, which should be enough to tell other MDS's about them.

From there, we'll need to communicate them to the clients. Probably, we can just extend the "extra" field in the mds request reply. CEPH_MDS_OP_CREATE already has an "extra" blob, and the fact that we transmit the length of the blob means that we should just be able to tack an interval_set onto the end. MKDIR, MKNOD and SYMLINK replies will need to transmit an "extra" blob as well that will just have the interval_set.

The kernel will need to grow the ability to decode and deal with interval_sets. I looked briefly over the code there, but didn't see any places where the kclient deals with interval_sets today. The kernel doesn't seem to have much in the way of infrastructure for dealing with interval_sets either, so we may need to add that or emulate it with IDA or something.

As the client creates new inodes asynchronously, it can assign them out of that set. ceph_mds_request_head has an ino field that is used for replays. I propose we use that field to also inform the MDS which inode value it should use for new requests as well. That is contingent on existing clients setting that to "0" on a new request. If any current ones don't then we may need to key that behavior off of a feature bit or something.

Actions

Copy link

Updated by Jeff Layton over 4 years ago

I've started working on patches to add this, but I see a potential problem. The idea is to delegate a range of inodes to a session, and to hand those out to the clients (probably in openc/mkdir/mknod/symlink replies). My thinking was to track these interval sets in the SessionMap (similarly to how prealloc_inos is tracked), but now I'm not sure that's sufficient.

What happens if the client grabs an inode from this set, and does an async create, but the session gets torn down for some reason before the openc request is sent? If the old session is purged from the sessionmap before the new session is established, then we don't really have a way to ensure continuity between the interval sets for each session.

This sort of tells me that hanging the set of delegated inodes off the session may be the wrong thing to do. Unfortunately, I'm not sure what the right thing to do is.

Actions

Copy link

#10

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

I've started working on patches to add this, but I see a potential problem. The idea is to delegate a range of inodes to a session, and to hand those out to the clients (probably in openc/mkdir/mknod/symlink replies). My thinking was to track these interval sets in the SessionMap (similarly to how prealloc_inos is tracked), but now I'm not sure that's sufficient.

What happens if the client grabs an inode from this set, and does an async create, but the session gets torn down for some reason before the openc request is sent? If the old session is purged from the sessionmap before the new session is established, then we don't really have a way to ensure continuity between the interval sets for each session.

This sort of tells me that hanging the set of delegated inodes off the session may be the wrong thing to do. Unfortunately, I'm not sure what the right thing to do is.

I haven't studied this bit of code but why can't we keep a closed/dead session in the SessionMap until all of its allocated inodes have been purged?

Actions

Copy link

#11

Updated by Jeff Layton over 4 years ago

Patrick Donnelly wrote:

I haven't studied this bit of code but why can't we keep a closed/dead session in the SessionMap until all of its allocated inodes have been purged?

How would we purge inodes from a closed/dead session? Suppose the client drops off the net and never comes back -- at what point do we give up on it? I suppose we could consider ino_t's that were delegated to blacklisted clients to be null and void, but that seems a bit extreme.

To make this work too, we'd need to somehow allow a new session to take over the inode range that was previously held by the old one. Maybe we could consider using the reclaim interfaces we added for ganesha for this? Allow the client to reclaim a inode range from a previous version of itself?

Actions

Copy link

#12

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

Patrick Donnelly wrote:

I haven't studied this bit of code but why can't we keep a closed/dead session in the SessionMap until all of its allocated inodes have been purged?

How would we purge inodes from a closed/dead session? Suppose the client drops off the net and never comes back -- at what point do we give up on it? I suppose we could consider ino_t's that were delegated to blacklisted clients to be null and void, but that seems a bit extreme.

When the session is closed/blacklisted/evicted (or: the session cannot be reclaimed) we should clean up all of the inodes allocated to the client. Why is that extreme?

Actions

Copy link

#13

Updated by Jeff Layton over 4 years ago

Patrick Donnelly wrote:

When the session is closed/blacklisted/evicted (or: the session cannot be reclaimed) we should clean up all of the inodes allocated to the client. Why is that extreme?

The plan was to not worry about adding a mechanism to allow the client to voluntarily return an inode number range. A client may hold no caps, but still have a set of inode numbers delegated to it. If such a client drops off the net, do we want to blacklist it?

Actions

Copy link

#14

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

Patrick Donnelly wrote:

When the session is closed/blacklisted/evicted (or: the session cannot be reclaimed) we should clean up all of the inodes allocated to the client. Why is that extreme?

The plan was to not worry about adding a mechanism to allow the client to voluntarily return an inode number range. A client may hold no caps, but still have a set of inode numbers delegated to it. If such a client drops off the net, do we want to blacklist it?

Generally, yes (that's how it works now?). Also, most clients hold at least some capabilities even if just on the root inode.

Actions

Copy link

#15

Updated by Jeff Layton over 4 years ago

Made some progress this week after I got over the hump of going through the journaling. I now have a patchset that assigns an interval_set of inodeno_t's to a client session for delegation.

The next step is to make the mds hand these out to the clients. I'm still trying to settle on the best way to do that. Possibly in the "extra" reply info for create (and maybe mkdir)? I'll probably start there.

I think too that we're going to need to patch the userland client to track and use delegated inode numbers, if only to allow us to write tests for it. What we could do is teach the userland client to issue creates with delegated inode numbers when it has the caps, and just do them synchronously for now.

Actions

Copy link

#16

Updated by Jeff Layton over 4 years ago

I added a patch that makes the MDS encode it as part of the "extra" info that gets tacked onto the create reply. I then coded up a kernel patch that has it dump anything beyond the "ino" value in hex. This is what the interval_set looks like on the wire:

[ 3358.231129] delegated_inos:00000000: 01 00 00 00 02 00 00 00 00 01 00 00 64 00 00 00
[ 3358.231134] delegated_inos:00000010: 00 00 00 00

...and here's the corresponding range in the MDS log:

[0x10000000002~0x64]

...so, it looks like an interval_set is transmitted as an array of uint64_t pairs (start and length), preceded by a 32-bit array length. The catch is that most clients currently in the field will barf on an extra blob that is larger than they expect. I have a patch queued up to fix this, but we may need to make this conditional on a feature bit or something.

We also don't have an interval_set data structure in the kernel, so I'm thinking what I may do is use an IDA hash to store the range. That should be reasonably efficient, but we'll need to deal with the fact that the MDS could send us a range that includes an inode we've already used, so we may need to keep a second hash with the values in the range that are in flight.

Actions

Copy link

#17

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

I added a patch that makes the MDS encode it as part of the "extra" info that gets tacked onto the create reply. I then coded up a kernel patch that has it dump anything beyond the "ino" value in hex. This is what the interval_set looks like on the wire:

[ 3358.231129] delegated_inos:00000000: 01 00 00 00 02 00 00 00 00 01 00 00 64 00 00 00
[ 3358.231134] delegated_inos:00000010: 00 00 00 00

...and here's the corresponding range in the MDS log:

[0x10000000002~0x64]

...so, it looks like an interval_set is transmitted as an array of uint64_t pairs (start and length), preceded by a 32-bit array length. The catch is that most clients currently in the field will barf on an extra blob that is larger than they expect. I have a patch queued up to fix this, but we may need to make this conditional on a feature bit or something.

I would suggest that the MDS only supply to a client one continuous range of delegated inodes. Then you only need a start/length.

Actions

Copy link

#18

Updated by Jeff Layton over 4 years ago

That would make it a little simpler, but it's not really that big a deal to track an arbitrary interval_set. We have an ID allocator facility in the kernel that is quite efficient (based on xarrays), especially when dealing with small contiguous ranges (as we'll mostly be doing here), and that also provides a way to keep track of which ones are still available for use.

Actions

Copy link

#19

Updated by Jeff Layton over 4 years ago

I have patches for this for the MDS, and the kernel, but I keep hitting a race where the client adds an already-used ino_t back into the pile of delegated ino_t's. Presumably, we're applying a now stale update to the delegated_inos set.

The problem is figuring out when it's safe to remove these values from the xarray -- i.e. when can we reliably assume that the inode number cannot show up again in the delegated_inos pile in a create response.

I've tried a number of fixes that haven't helped:

1) keep track of the r_tid that last updated the deleg_inos cache, and only accept updates from newer tids

2) timestamp the inode value when we get the create reply, and only remove the ino_t from the xarray when the r_started timestamp is later the the timestamp of the create reply. Thus, we'll only remove entries where the reply came in before we started the call.

None of these seem to fully close the race window, which makes me wonder whether I'm misunderstanding it.

Actions

Copy link

#20

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

I have patches for this for the MDS, and the kernel, but I keep hitting a race where the client adds an already-used ino_t back into the pile of delegated ino_t's. Presumably, we're applying a now stale update to the delegated_inos set.

The problem is figuring out when it's safe to remove these values from the xarray -- i.e. when can we reliably assume that the inode number cannot show up again in the delegated_inos pile in a create response.

I've tried a number of fixes that haven't helped:

1) keep track of the r_tid that last updated the deleg_inos cache, and only accept updates from newer tids

2) timestamp the inode value when we get the create reply, and only remove the ino_t from the xarray when the r_started timestamp is later the the timestamp of the create reply. Thus, we'll only remove entries where the reply came in before we started the call.

None of these seem to fully close the race window, which makes me wonder whether I'm misunderstanding it.

Seems to me the kernel client needs to maintain two interval sets: (1) the set the MDS communicates to the client, only updated by the MDS; (2) the set of inodes the kernel has used. Everytime you need an inode number, you walk (1) and throw out inodes already in (2). (And a few optimizations are available to make this faster but probably not worth it.) Would this not work?

Actions

Copy link

#21

Updated by Jeff Layton over 4 years ago

The problem there is that the second set would grow without bound. It's not a lot of info per inode, but it's enough that I don't think we want to do that. We need to keep the sets trimmed down to as small as we can get away with.

The big problem is that all of the creates are not necessarily processed in a strict order on the client or MDS. My suspicion is that we occasionally get some delays and the client ends up taking an old delegated inode set as an update after a newer one. I think we need to ensure that the client only takes newer ones.

I plan to see about adding in a server-side sequence that we'll bump whenever we add new delegated inodes. We can prefix the interval_set in the reply with that value and only allow the client to take updates that are newer (as processed by the MDS) than the one it has.

With that we should be able to stop tracking any that have been used and that aren't in the latest update.

Actions

Copy link

#22

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

The problem there is that the second set would grow without bound. It's not a lot of info per inode, but it's enough that I don't think we want to do that. We need to keep the sets trimmed down to as small as we can get away with.

Assuming contiguous inodes, the second set should usually be a single interval, a start and length. Why would there be holes or growth without bound?

The big problem is that all of the creates are not necessarily processed in a strict order on the client or MDS. My suspicion is that we occasionally get some delays and the client ends up taking an old delegated inode set as an update after a newer one. I think we need to ensure that the client only takes newer ones.

This should not be possible. Client messages are always processed in order. The same should be true for reply handling in the kernel.

I plan to see about adding in a server-side sequence that we'll bump whenever we add new delegated inodes. We can prefix the interval_set in the reply with that value and only allow the client to take updates that are newer (as processed by the MDS) than the one it has.

With that we should be able to stop tracking any that have been used and that aren't in the latest update.

Actions

Copy link

#23

Updated by Jeff Layton over 4 years ago

Patrick Donnelly wrote:

Jeff Layton wrote:

The problem there is that the second set would grow without bound. It's not a lot of info per inode, but it's enough that I don't think we want to do that. We need to keep the sets trimmed down to as small as we can get away with.

Assuming contiguous inodes, the second set should usually be a single interval, a start and length. Why would there be holes or growth without bound?

You can't make that assumption. Given that we're divvying up ino_t's into separate pools (preallocated and delegated), fragmentation is a given.

Also, I'm not tracking these in an interval_set in the kernel, as there isn't such an abstraction there. I may eventually need to add one, but for now, I'm not going down that road.

I'm using an xarray, which is based on top of the kernel's radix tree code, as I also need to keep track of a little info per ino_t (though in principle we could use multiple interval_sets or something like it to do this too).

The big problem is that all of the creates are not necessarily processed in a strict order on the client or MDS. My suspicion is that we occasionally get some delays and the client ends up taking an old delegated inode set as an update after a newer one. I think we need to ensure that the client only takes newer ones.

This should not be possible. Client messages are always processed in order. The same should be true for reply handling in the kernel.

In what order? In order of the tid? Serializing every call on a session sounds terrible for performance.

Regardless, in the kernel, reply processing is handed off to workqueues, and there's no specific ordering there (nor do I think we want it).

Actions

Copy link

#24

Updated by Patrick Donnelly over 4 years ago

Jeff Layton wrote:

The big problem is that all of the creates are not necessarily processed in a strict order on the client or MDS. My suspicion is that we occasionally get some delays and the client ends up taking an old delegated inode set as an update after a newer one. I think we need to ensure that the client only takes newer ones.

This should not be possible. Client messages are always processed in order. The same should be true for reply handling in the kernel.

In what order? In order of the tid? Serializing every call on a session sounds terrible for performance.

In the order they come off the wire. The MDS must process client messages in order otherwise e.g. cap messages may release caps before a client's update is processed.

Regardless, in the kernel, reply processing is handed off to workqueues, and there's no specific ordering there (nor do I think we want it).

That's probably how you're seeing the asynchronous behavior. It's probably fine the kernel handles replies out of order but the MDS cannot.

Actions

Copy link

#25

Updated by Zheng Yan over 4 years ago

Jeff Layton wrote:

I have patches for this for the MDS, and the kernel, but I keep hitting a race where the client adds an already-used ino_t back into the pile of delegated ino_t's. Presumably, we're applying a now stale update to the delegated_inos set.

The problem is figuring out when it's safe to remove these values from the xarray -- i.e. when can we reliably assume that the inode number cannot show up again in the delegated_inos pile in a create response.

I've tried a number of fixes that haven't helped:

1) keep track of the r_tid that last updated the deleg_inos cache, and only accept updates from newer tids

2) timestamp the inode value when we get the create reply, and only remove the ino_t from the xarray when the r_started timestamp is later the the timestamp of the create reply. Thus, we'll only remove entries where the reply came in before we started the call.

None of these seem to fully close the race window, which makes me wonder whether I'm misunderstanding it.

how about only encoding newly delegated inode numbers into request reply. If mds crashes, client clears all of its delegated inode numbers.

Actions

Copy link

#26

Updated by Jeff Layton over 4 years ago

That's not a bad idea. We'd have to keep track of a separate set of newly-added ino_t's to send in the reply, but that's probably doable. Let me ponder it.

Actions

Copy link

#27

Updated by Zheng Yan over 4 years ago

Jeff Layton wrote:

That's not a bad idea. We'd have to keep track of a separate set of newly-added ino_t's to send in the reply, but that's probably doable. Let me ponder it.

why do we need to keep separate set? For client, just add delegated inos to s_delegated_inos when getting a reply. remove ino from s_delegated_inos when client uses a delegated ino. For mds, delegate new inos to client at the time it send a reply.

Actions

Copy link

#28

Updated by Jeff Layton over 4 years ago

Pull request ID set to 31817

You're right. I just pushed a patch to be squashed in on top of the existing series. I'm testing it now with the client-side code I have and it seems to be doing fine so far. Thanks for the suggestion!

Actions

Copy link

#29

Updated by Patrick Donnelly over 4 years ago

Status changed from New to Fix Under Review

Actions

Copy link

#30

Updated by Patrick Donnelly over 4 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Feature #39129

create mechanism to delegate ranges of inode numbers to client

Updated by Jeff Layton about 5 years ago

Updated by Patrick Donnelly about 5 years ago

Updated by Jeff Layton almost 5 years ago

Updated by Jeff Layton almost 5 years ago

Updated by Patrick Donnelly almost 5 years ago

Updated by Jeff Layton almost 5 years ago

Updated by Jeff Layton almost 5 years ago

Updated by Jeff Layton over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Zheng Yan over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Zheng Yan over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly over 4 years ago