O_TMPFILE support in libcephfs
nfs-ganesha could make use of the ability to create a disconnected inode (pinned only by an open file descriptor) that can be linked into place later. If libcephfs supported that, it would allow closing some potential races that can occur when an OPEN RPC fails.
The kernel implements this with the O_TMPFILE open flag, and I think we'll want to do the same in libcephfs.
#1 Updated by John Spray over 3 years ago
Main decision here is probably whether it should be a stray or some new mechanism.
Strays feel like overkill here as the temporary inode only lives until the client either links it or the client session ends, whereas strays are truly persistent things.
We could make it an entirely in-memory thing from the MDS's point of view if the client would include it in client replay.
Or I wonder if we could do something like giving the client an early reply for an openc(O_TMPFILE), but leaving the request effectively in flight until they link? That feels weird.
#2 Updated by Jeff Layton over 3 years ago
I think it makes sense to optimize for the success case here. In most cases, the link will be successful and it'll end up being a persistent inode. So, there may not be much benefit to do it all in memory if doing the "rename" into the permanent location is relatively cheap? I guess it depends on whether how much extra overhead there is in having to track a stray entry?
#3 Updated by John Spray over 3 years ago
The stray would end up getting journaled, probably never written to backing store as long as the link operation came along before the journal entry expired (basically certain). So it's more of an extra journal write than an extra IO I suppose. If doing it with strays is code-simpler then it's probably an acceptable cost.
#4 Updated by Greg Farnum over 3 years ago
I'm pretty skeptical that doing it ephemerally (without initially setting it up as a journaled stray) is a feasible strategy. We'd need to handle clean up in the case where the client writes to the OSDs and then closes the file without linking it in to place; we'd need to handle allocating an inode without writing it down somewhere; etc.
If we just do it as a stray inode now, we can easily migrate it into the client-side inode allocation whenever we finally implement that. But hacking a special case in for it would be a lot of work with I think not much payoff and another dimension of complexity to account for in recovery cases etc.
#5 Updated by John Spray over 3 years ago
I was assuming that when doing it ephemerally we would not be allowing any data IO operations on the inode until it was linked somewhere, to avoid the cleanup -- the idea would be that there would be no recovery path at all other than clientreplay.
My aversion to strays in principle can definitely be overcome by code simplicity in practice
#6 Updated by Jeff Layton over 3 years ago
Yeah, with Linux' O_TMPFILE you can definitely do I/O to the inode before it's linked, and I think it'd be good to mirror those semantics if we can. That said, we're not required to use O_TMPFILE here. We could create some other mechanism for doing this
But, the more we discuss it, the more it sounds like just using the strays infrastructure is the right thing to do. Note that we can always change the internal implementation later if we need to, as long as we don't break existing apps that rely on it.