Project

General

Profile

Feature #39098

mds: lock caching for asynchronous unlink

Added by Jeff Layton 7 months ago. Updated 3 months ago.

Status:
New
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

In order to allow the client to asynchronously delete files, we need Fx caps on the parent, and Lx caps on the inode of the dentry being unlinked (because the link count will change). The MDS does not seem to hand out Lx caps today. When I use cephfs-shell to create a file, I see this in the client debug log:

2019-04-03 15:43:18.276 7f0fc17fa700 10 client.4196 add_update_cap issued - -> pAsxLsXsxFsxcrwb from mds.0 on 0x10000000002.head(faked_ino=0 ref=0 ll_ref=0 cap_refs={} open={} mode=100666 size=0/0 nlink=1 btime=2019-04-03 15:43:18.276971 mtime=2019-04-03 15:43:18.276971 ctime=2019-04-03 15:43:18.276971 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000002 ts 0/0 objects 0 dirty_or_tx 0] 0x7f0fb4006f40)

Notably absent from the caps= set is Lx. It would be good if the MDS would grant those by default, particularly when we're handing out exclusive caps of other flavors.

0001-MDS-only-take-linklock-in-unlink-if-client-doesn-t-h.patch View - PATCH: only acquire linklock if client doesn't hold Lx (1.46 KB) Jeff Layton, 04/05/2019 03:29 PM


Related issues

Related to fs - Feature #38951: implement buffered unlink in libcephfs New

History

#1 Updated by Jeff Layton 7 months ago

  • Related to Feature #38951: implement buffered unlink in libcephfs added

#2 Updated by Patrick Donnelly 7 months ago

  • Subject changed from allow the MDS to hand out Lx caps to mds: allow handing out Lx caps
  • Assignee set to Zheng Yan
  • Priority changed from Normal to High
  • Target version set to v15.0.0
  • Start date deleted (04/03/2019)
  • Source set to Development

#3 Updated by Zheng Yan 7 months ago

please try following patch

diff --git a/src/client/Client.cc b/src/client/Client.cc
index 2cf1a9e731..fc7a47ee23 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -3504,6 +3509,16 @@ void Client::check_caps(Inode *in, unsigned flags)
       // journal max_size=0.
       if (in->max_size == 0)
        retain |= CEPH_CAP_ANY_RD;
+
+      if (in->nlink == 1 &&
+         (issued & (CEPH_CAP_LINK_SHARED | CEPH_CAP_LINK_EXCL)) &&
+         in->dentries.size() == 1) {
+       Inode *pin = in->get_first_parent()->dir->parent_inode;
+       if (pin->caps_issued_mask(CEPH_CAP_FILE_EXCL)) {
+         wanted = CEPH_CAP_LINK_SHARED | CEPH_CAP_LINK_EXCL;
+         retain |= wanted;
+       }
+      }
     }
   }

#4 Updated by Jeff Layton 7 months ago

I added that to the userland client and then used cephfs-shell to create a directory and a file inside it. Neither create resulted in the inode getting Lx caps:

2019-04-04 06:52:49.679 7fe7a3fff700 10 client.4201 add_update_cap issued - -> pAsxLsXsxFsx from mds.0 on 0x10000000000.head(faked_ino=0 ref=0 ll_ref=0 cap_refs={} open={} mode=40777 size=0/0 nlink=1 btime=2019-04-04 06:52:49.639626 mtime=2019-04-04 06:52:49.639626 ctime=2019-04-04 06:52:49.639626 caps=pAsxLsXsxFsx(0=pAsxLsXsxFsx) 0x7fe79c005f50)

2019-04-04 06:53:16.700 7fe7a3fff700 10 client.4201 add_update_cap issued - -> pAsxLsXsxFsxcrwb from mds.0 on 0x10000000001.head(faked_ino=0 ref=0 ll_ref=0 cap_refs={} open={} mode=100666 size=0/0 nlink=1 btime=2019-04-04 06:53:16.700041 mtime=2019-04-04 06:53:16.700041 ctime=2019-04-04 06:53:16.700041 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) objectset[0x10000000001 ts 0/0 objects 0 dirty_or_tx 0] 0x7fe79c006a80)

#5 Updated by Zheng Yan 7 months ago

please following mds patch

diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index 5f49917e12..daef063bf2 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -4155,6 +4155,7 @@ void Server::handle_client_openc(MDRequestRef& mdr)
   // do the open
   Capability *cap = mds->locker->issue_new_caps(in, cmode, mdr->session, realm, req->is_replay());
   in->authlock.set_state(LOCK_EXCL);
+  in->linklock.set_state(LOCK_EXCL);
   in->xattrlock.set_state(LOCK_EXCL);

   if (cap && (cmode & CEPH_FILE_MODE_WR)) {
@@ -5729,6 +5730,7 @@ void Server::handle_client_mknod(MDRequestRef& mdr)
       // put locks in excl mode
       newi->filelock.set_state(LOCK_EXCL);
       newi->authlock.set_state(LOCK_EXCL);
+      newi->linklock.set_state(LOCK_EXCL);
       newi->xattrlock.set_state(LOCK_EXCL);

       dout(15) << " setting a client_range too, since this is a regular file" << dendl;

#6 Updated by Zheng Yan 7 months ago

following path improves the chance that client get Fsx on directory.

diff --git a/src/client/Client.cc b/src/client/Client.cc
index 2cf1a9e731..9d189473a7 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -3489,9 +3489,7 @@ void Client::check_caps(Inode *in, unsigned flags)
   if (!unmounting && in->nlink > 0) {
     if (wanted) {
       retain |= CEPH_CAP_ANY;
-    } else if (in->is_dir() &&
-              (issued & CEPH_CAP_FILE_SHARED) &&
-              (in->flags & I_COMPLETE)) {
+    } else if (in->is_dir()) {
       // we do this here because we don't want to drop to Fs (and then
       // drop the Fs if we do a create!) if that alone makes us send lookups
       // to the MDS. Doing it in in->caps_wanted() has knock-on effects elsewhere

#7 Updated by Jeff Layton 7 months ago

Thanks Zheng, with both of those patches, I get Lx caps on newly-created files, but not on new directories. This is true via cephfs-shell and kcephfs (with my patches).

#8 Updated by Jeff Layton 7 months ago

Now that I think about it, Lx may not be required for rmdir anyway:

We only need that for files that can be hardlinked. A directory can't be hardlinked, so if we have Fx on the parent, then we know that the link count cannot go to 0.

What we do need to do for a directory though is to ensure that it's empty. So we need Fs caps on it, plus that it's complete and empty.

To remove a directory asynchronously, I think we need Fs caps on that inode and to know that it's complete and empty. So I think the MDS patch is probably sufficient here.

EDIT: Patrick pointed out that we track the link count on a directory as the number of subdirectories inside it, so we probably would need Lx caps on the parent for an rmdir. For the directory inode being removed, we'd just need to know that it's empty, and I don't think we'd need Lx caps for that.

At this point, I say let's not worry about trying to do rmdir asynchronously. We can always add that in later.

#9 Updated by Jeff Layton 7 months ago

This does help the MDS to hand out Lx caps on newly created files, but I'm having some trouble getting the cap handling right. What I'm seeing is that once we issue the UNLINK call to the MDS, it then tries to revoke Lx caps from the client. The client is holding references to those caps though, so we deadlock.

I tried adding this (in kernel client) when we go to issue an async unlink call, but it didn't seem to help.

req->r_inode_unless = CEPH_CAP_LINK_EXCL;

How to we prevent a Lx cap revoke in this case?

#10 Updated by Zheng Yan 7 months ago

mds takes xlock on linklock when unlinking file, which always revokes Lsx. why does client want to keep Lx after sending request to mds.

#11 Updated by Jeff Layton 7 months ago

Zheng Yan wrote:

mds takes xlock on linklock when unlinking file, which always revokes Lsx. why does client want to keep Lx after sending request to mds.

I don't think we care whether the client keeps Lx after the operation. What we do care about is preventing a race condition where Lx is revoked after we've checked for it, but before transmitting the request.

For instance, suppose I check and find that I have Fx on the parent and Lx on the inode of the dentry being unlinked, so I decide to do an unlink asynchronously. Before I can transmit the call though, Lx is revoked and the client returns it. At that point, we'd be doing an asynchronous unlink when we shouldn't.

My thinking was that we should hold references to those caps over the request, and that the MDS should avoid taking the linklock on the inode and filelock on the parent when the client holds those caps.

Note that I have some similar concerns about the permission checks that we have been aggressively delegating to the client. Any operation that relies on a client-side permission check really ought to require that the client hold As caps over the life of the request, or it too is going to be subject to ToC/ToU races.

#12 Updated by Jeff Layton 7 months ago

Continuing that thought, I see that handle_client_unlink does this unconditionally:

  lov.add_xlock(&in->linklock);                                                 

What I think we ought to do is have the MDS check whether the client has Lx caps on "in" already, and simply avoid doing the above call if it does. Basically, allow the MDS to assume that it already holds the exclusive linklock if the client that issued the request has Lx caps.

That doesn't look too difficult to determine, but what I'm not sure of is whether we also need to prevent those caps from being revoked until the unlink operation is complete. One would think so.

#13 Updated by Jeff Layton 7 months ago

This patch seems to do the right thing for me. I'm not sure whether this is the right approach though, and we will need to consider legacy client behavior too.

#14 Updated by Patrick Donnelly 7 months ago

Jeff Layton wrote:

Zheng Yan wrote:

mds takes xlock on linklock when unlinking file, which always revokes Lsx. why does client want to keep Lx after sending request to mds.

I don't think we care whether the client keeps Lx after the operation. What we do care about is preventing a race condition where Lx is revoked after we've checked for it, but before transmitting the request.

We care in the case of rmdir. It unnecessarily slows down asynchronous recursive `rm -rf` as the Lx on the parent dir gets revoked with each rmdir.

For instance, suppose I check and find that I have Fx on the parent and Lx on the inode of the dentry being unlinked, so I decide to do an unlink asynchronously. Before I can transmit the call though, Lx is revoked and the client returns it. At that point, we'd be doing an asynchronous unlink when we shouldn't.

The client would not return the Lx cap as long as you obtain a reference to it, as I understand things.

My thinking was that we should hold references to those caps over the request, and that the MDS should avoid taking the linklock on the inode and filelock on the parent when the client holds those caps.

This makes sense to me.

#15 Updated by Patrick Donnelly 7 months ago

Jeff Layton wrote:

This patch seems to do the right thing for me. I'm not sure whether this is the right approach though, and we will need to consider legacy client behavior too.

If it's an issue for older clients, we can use a cephfs feature bit for this.

#16 Updated by Zheng Yan 7 months ago

Jeff Layton wrote:

Zheng Yan wrote:

mds takes xlock on linklock when unlinking file, which always revokes Lsx. why does client want to keep Lx after sending request to mds.

I don't think we care whether the client keeps Lx after the operation. What we do care about is preventing a race condition where Lx is revoked after we've checked for it, but before transmitting the request.

For instance, suppose I check and find that I have Fx on the parent and Lx on the inode of the dentry being unlinked, so I decide to do an unlink asynchronously. Before I can transmit the call though, Lx is revoked and the client returns it. At that point, we'd be doing an asynchronous unlink when we shouldn't.

My thinking was that we should hold references to those caps over the request, and that the MDS should avoid taking the linklock on the inode and filelock on the parent when the client holds those caps.

If mds revokes Lx or Fx, client should flush the async unlinks immediately

Note that I have some similar concerns about the permission checks that we have been aggressively delegating to the client. Any operation that relies on a client-side permission check really ought to require that the client hold As caps over the life of the request, or it too is going to be subject to ToC/ToU races.

#17 Updated by Zheng Yan 7 months ago

Zheng Yan wrote:

Jeff Layton wrote:

Zheng Yan wrote:

mds takes xlock on linklock when unlinking file, which always revokes Lsx. why does client want to keep Lx after sending request to mds.

I don't think we care whether the client keeps Lx after the operation. What we do care about is preventing a race condition where Lx is revoked after we've checked for it, but before transmitting the request.

For instance, suppose I check and find that I have Fx on the parent and Lx on the inode of the dentry being unlinked, so I decide to do an unlink asynchronously. Before I can transmit the call though, Lx is revoked and the client returns it. At that point, we'd be doing an asynchronous unlink when we shouldn't.

My thinking was that we should hold references to those caps over the request, and that the MDS should avoid taking the linklock on the inode and filelock on the parent when the client holds those caps.

If mds revokes Lx or Fx, client should flush the async unlinks (on the file or under the directory) immediately

Note that I have some similar concerns about the permission checks that we have been aggressively delegating to the client. Any operation that relies on a client-side permission check really ought to require that the client hold As caps over the life of the request, or it too is going to be subject to ToC/ToU races.

#18 Updated by Jeff Layton 7 months ago

Zheng Yan wrote:

If mds revokes Lx or Fx, client should flush the async unlinks (on the file or under the directory) immediately

The patchset I have flushes them immediately either way. If the caps are revoked then what we'll do is stop taking extra references to those caps when someone issues an unlink and start waiting on the replies for new unlink calls.

#19 Updated by Zheng Yan 6 months ago

With change in http://tracker.ceph.com/issues/39354, Lx is not needed for async unlink. (Other request will release xlocks on the linklock when it detects the 'outoforder' wrlock on filelock)

#20 Updated by Patrick Donnelly 6 months ago

Zheng Yan wrote:

With change in http://tracker.ceph.com/issues/39354, Fx is not needed for async unlink. (Other request will release xlocks on the linklock when it detects the 'outoforder' wrlock)

You're saying we don't need Fx on the containing directory to do async unlink? Or you mean we don't need Fx on the file the client is unlinking (that I would agree with!).

#21 Updated by Zheng Yan 6 months ago

Sorry, I mean we don't need Lx

#22 Updated by Jeff Layton 6 months ago

Zheng Yan wrote:

Sorry, I mean we don't need Lx

I'm not sure I understand. What if other clients have Ls on the file being unlinked at the time? Don't we need to ensure those caps get recalled so that the client can locally change the link count on the file while the call is in flight?

The rules that Patrick and I worked out were:

To async unlink a normal file, you need:
Fx on the parent dir
Lx on the inode being unlinked

To async rmdir a directory, you need:
Fx+Lx on the parent dir

The reasons for the difference there is that:

1) in ceph, the link count on a directory is the number of subdirectories, so we need Lx on the parent in order to set that correctly

2) since directories can't be hardlinked, we know that there cannot be more than one dentry associated with one, so the link count on the directory being removed is not terribly important

#23 Updated by Zheng Yan 6 months ago

Jeff Layton wrote:

Zheng Yan wrote:

Sorry, I mean we don't need Lx

I'm not sure I understand. What if other clients have Ls on the file being unlinked at the time? Don't we need to ensure those caps get recalled so that the client can locally change the link count on the file while the call is in flight?

mds will recall the Ls cap as normal when handling the unlink request.

The rules that Patrick and I worked out were:

To async unlink a normal file, you need:
Fx on the parent dir
Lx on the inode being unlinked

As long as client holds Fx caps of parent dir, no one else can unlink/rename the file. Other request that tries unlink/renaming the file can't take wrlock on filelock of the parent dir. It will also release all locks when it detects that filelock has out-of-order wrlock. From this point of view, client also can do async unlink when it only holds Lx caps on the inode being unlinked.

To async rmdir a directory, you need:
Fx+Lx on the parent dir

The reasons for the difference there is that:

1) in ceph, the link count on a directory is the number of subdirectories, so we need Lx on the parent in order to set that correctly

We do this in client code (actually link count of dir is control by Fs caps). In mds internal, link count of directory inode is either 1 or 0.

2) since directories can't be hardlinked, we know that there cannot be more than one dentry associated with one, so the link count on the directory being removed is not terribly important

#24 Updated by Jeff Layton 6 months ago

Zheng Yan wrote:

Jeff Layton wrote:

Zheng Yan wrote:

Sorry, I mean we don't need Lx

I'm not sure I understand. What if other clients have Ls on the file being unlinked at the time? Don't we need to ensure those caps get recalled so that the client can locally change the link count on the file while the call is in flight?

mds will recall the Ls cap as normal when handling the unlink request.

I don't think that's enough.

That means that you'll have a window of indeterminate length after the unlink() call returns and the caps get recalled by the server. During that window of time, the client that does the unlink() will see a link count of N-1 whereas any other client will see a link count of N.

I think we do need Lx caps here, as that would cause a statx(..., STATX_NLINK, ...) call on another client from returning until Lx is returned.

The rules that Patrick and I worked out were:

To async unlink a normal file, you need:
Fx on the parent dir
Lx on the inode being unlinked

As long as client holds Fx caps of parent dir, no one else can unlink/rename the file. Other request that tries unlink/renaming the file can't take wrlock on filelock of the parent dir. It will also release all locks when it detects that filelock has out-of-order wrlock. From this point of view, client also can do async unlink when it only holds Lx caps on the inode being unlinked.

...unless there are hardlinks to the file in different directories. A second client could have opened the file via one of those hardlinks and gotten Ls caps on it. Nothing would prevent it from seeing an incorrect link count until the unlink is actually processed.

For instance:

client1$ mkdir foo/
client1$ mkdir bar/
client1$ touch foo/baz
client1$ ln foo/baz bar/baz

client2 opens bar/baz for read

client1 does async unlink of foo/baz because it has Fx caps on dir.

client2 does statx for STATX_NLINK on bar/baz fd. It has Ls caps on it, and returns a link count of 2.

I suppose we could say that client1 wouldn't have a link count on the inode being unlinked at that point, as it would have to return Ls caps to the MDS in order for the unlink to proceed there.

That might work if that's what you were thinking.

We do this in client code (actually link count of dir is control by Fs caps). In mds internal, link count of directory inode is either 1 or 0.

Ok, that makes sense I think.

#25 Updated by Zheng Yan 6 months ago

Jeff Layton wrote:

Zheng Yan wrote:

Jeff Layton wrote:

Zheng Yan wrote:

Sorry, I mean we don't need Lx

I'm not sure I understand. What if other clients have Ls on the file being unlinked at the time? Don't we need to ensure those caps get recalled so that the client can locally change the link count on the file while the call is in flight?

mds will recall the Ls cap as normal when handling the unlink request.

I don't think that's enough.

That means that you'll have a window of indeterminate length after the unlink() call returns and the caps get recalled by the server. During that window of time, the client that does the unlink() will see a link count of N-1 whereas any other client will see a link count of N.

I think we do need Lx caps here, as that would cause a statx(..., STATX_NLINK, ...) call on another client from returning until Lx is returned.

Yes, this can causes inconsistency. But it's not unique to link count. For example, one client does async create/unlink, another client do lookup. To be consistent, the client (that does create/unlink) needs to exclusively hold dentry.

I'd like to ignore these inconsistency for now. Later, we may do some trick on the directory caps.

#26 Updated by Greg Farnum 6 months ago

Zheng Yan wrote:

Yes, this can causes inconsistency. But it's not unique to link count. For example, one client does async create/unlink, another client do lookup. To be consistent, the client (that does create/unlink) needs to exclusively hold dentry.

I'd like to ignore these inconsistency for now. Later, we may do some trick on the directory caps.

I don't think that's plausible. In addition to just violating our "we are strictly consistent" statement, this behavior in particular will cause problems with some of the most common patterns in distributed FSes where clients use the existence or non-existence of files as coordination triggers. We'd have changed CephFS so rsync doesn't suck at the expense of breaking things as simple as some webservers!

#27 Updated by Jeff Layton 6 months ago

Zheng Yan wrote:

Yes, this can causes inconsistency. But it's not unique to link count. For example, one client does async create/unlink, another client do lookup. To be consistent, the client (that does create/unlink) needs to exclusively hold dentry.

I don't quite get this. How could another client do a lookup while this client holds Fx on the directory? Wouldn't that need to be recalled first? I would think that having Fx on a directory would preclude you from handing out dentry leases at all. The client with Fx caps wouldn't need them, and if you have other clients with a dentry lease then you can't hand out Fx in the first place.

I'd like to ignore these inconsistency for now. Later, we may do some trick on the directory caps.

For a prototype, sure, but we will need to sort out the rules before we can merge any of this.

So to be clear, what caps are needed to do unlink and link? Once we have that settled we can think about rename.

#28 Updated by Zheng Yan 5 months ago

Jeff Layton wrote:

Zheng Yan wrote:

Yes, this can causes inconsistency. But it's not unique to link count. For example, one client does async create/unlink, another client do lookup. To be consistent, the client (that does create/unlink) needs to exclusively hold dentry.

I don't quite get this. How could another client do a lookup while this client holds Fx on the directory? Wouldn't that need to be recalled first? I would think that having Fx on a directory would preclude you from handing out dentry leases at all. The client with Fx caps wouldn't need them, and if you have other clients with a dentry lease then you can't hand out Fx in the first place.

- To lookup a dentry, mds only need to rdlock corresponding dentry's lock. Having Fx on a directory does not prevent other client from rdlocking child dentry.
- To create/unlink a dentry, mds needs to wrlock parent inode's filelock/nestlock, xlock corresponding dentry's lock.

Filelock/nestlock of directory inode are more about statistics of directry/subtree. Having Fx on a directory just mean client may update the statistics. MDS may issue leases for any child dentries to any clients.

To handle this, we needs to introduce exclusive lease. if one client has exclusive lease, mds does not issue lease to other client.

I'd like to ignore these inconsistency for now. Later, we may do some trick on the directory caps.

For a prototype, sure, but we will need to sort out the rules before we can merge any of this.

So to be clear, what caps are needed to do unlink and link? Once we have that settled we can think about rename

#29 Updated by Jeff Layton 3 months ago

  • Subject changed from mds: allow handing out Lx caps to mds: lock caching for asynchronous unlink

Also available in: Atom PDF