Bug #54046
closedunaccessible dentries after fsstress run with namespace-restricted caps
Added by Jeff Layton over 2 years ago. Updated over 1 year ago.
100%
Description
I did an xfstests run in a subvolume with namespace restricted caps. Test generic/070 does an fsstress run, and then tries to delete the tree that it creates. Some of those deletes failed with permission denied errors:
+rm: cannot remove '/mnt/test/fsstress/p0/d64/d29e/d787/da0a/db79/db99/leae': Permission denied +rm: cannot remove '/mnt/test/fsstress/p0/d64/d29e/d787/da0a/db79/db99/c297': Permission denied +rm: cannot remove '/mnt/test/fsstress/p0/d64/d802/f976': Permission denied +rm: cannot remove '/mnt/test/fsstress/p0/d64/f7d8': Permission denied +rm: cannot remove '/mnt/test/fsstress/p0/d64/f3b5': Permission denied
...and I get the same error when I try to stat the dentries:
statx(AT_FDCWD, "c297", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7ffce516e6f0) = -1 EACCES (Permission denied)
Turning up debugging on the client shows the MDS rejecting the delete request with EACCES:
[11770.089390] ceph: do_request 000000008d72dbb7 done, result -13
...remounting the subvolume doesn't help, but if I unmount, reboot the entire cluster (thus restarting all of the MDS's), then I can get to them again:
statx(AT_FDCWD, "/mnt/test/fsstress/p0/d64/d29e/d787/da0a/db79/db99/c297", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFCHR|0444, stx_size=0, ...}) = 0
Updated by Jeff Layton over 2 years ago
The MDS I'm testing against has the latest fscrypt patchset based on top of this commit from ceph mainline:
9e8b2d8282e9 Merge pull request #44510 from rzarzynski/wip-cephadm-docfix
I doubt the fscrypt changes are the issue, but I can try testing a different cluster version if need be. I can collect debug logs, but there are 3 MDSs for this fs and fsstress generates a heavy workload. I might need some help targeting what you need.
Updated by Greg Farnum over 2 years ago
- Category set to Security Model
- Component(FS) MDS added
This issue should be pretty apparent from MDS logs where we can trace out the incoming client requests.
Jeff, I think you said in irc that this is run against a multimds setup? That may turn out to be important since I bet it's an issue with the path restriction and subtrees.
Updated by Jeff Layton over 2 years ago
Yes. I have max_mds set to 3, and I have 4 per fs (with 1 standby).
The good news is that this seems to be pretty easy to reproduce so far.
Updated by Dan van der Ster over 2 years ago
FWIW none of our users have noticed this -- all have path/namespace restricted caps. (octopus multi-mds, pacific single mds; and centos + fedora atomic clients)
Updated by Greg Farnum over 2 years ago
Jeff Layton wrote:
Yes. I have max_mds set to 3, and I have 4 per fs (with 1 standby).
The good news is that this seems to be pretty easy to reproduce so far.
Can you do so with "debug mds = 20" set and then post the logs, along with whatever failures you're getting on the client side so we know what files and requests to trace? :)
Updated by Jeff Layton over 2 years ago
$ cat /home/jlayton/git/xfstests-dev/results//generic/070.out.bad QA output created by 070 rm: cannot remove '/mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19': Permission denied rm: cannot remove '/mnt/test/fsstress/p0/dba/d5ad/d65b/d7bb/d242/f98b': Permission denied $ stat /mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19 stat: cannot statx '/mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19': Permission denied
Unfortunately, the logs are too large to attach here. Where would you like them?
Updated by Jeff Layton over 2 years ago
Logfiles are in a tarball at:
ceph-post-file: 20043459-1412-4880-8844-8f575228d1fb
Updated by Venky Shankar about 2 years ago
- Assignee set to Venky Shankar
- Target version set to v18.0.0
- Backport set to quincy, pacific
Updated by Venky Shankar about 2 years ago
Jeff,
For this unlink request
rm: cannot remove '/mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19': Permission denied
there is no corresponding mds unlink request or a path lookup failure in the MDS. "ff19" is a hardlink which has a corresponding mds request:
Jan 31 09:07:43 cephadm1 ceph-mds[4168]: mds.0.server handle_client_request client_request(client.234387:1752 link #0x100000402a0/ff19 #0x1000004028c/fef4 2022-01-31T14:07:43.911857+0000 caller_uid=0, caller_gid=0{0,}) v4
and its reply:
Jan 31 09:07:43 cephadm1 ceph-mds[4168]: mds.0.server reply_client_request 0 ((0) Success) client_request(client.234387:1752 link #0x100000402a0/ff19 #0x1000004028c/fef4 2022-01-31T14:07:43.911857+0000 caller_uid=0, caller_gid=0{0,}) v4
But, I cannot see the unlink request coming in or a failure with -EACCES for the path in question. Failures with -EACCES are these:
mds.test.cephadm1.tsiqgp:Jan 31 09:03:26 cephadm1 ceph-mds[4117]: mds.2.server reply_client_request -13 ((13) Permission denied) client_request(client.234333:5345 lookupino #0x30000008465 2022-01-31T14:03:26.757258+0000 caller_uid=0, caller_gid=0{0,}) v4 mds.test.cephadm1.tsiqgp:Jan 31 09:03:26 cephadm1 ceph-mds[4117]: mds.2.1168 send_message_client client.234333 192.168.1.209:0/3132864089 client_reply(???:5345 = -13 (13) Permission denied) v1 mds.test.cephadm1.tsiqgp:Jan 31 09:03:35 cephadm1 ceph-mds[4117]: mds.2.server reply_client_request -13 ((13) Permission denied) client_request(client.234354:21 lookupino #0x30000008465 2022-01-31T14:03:35.239479+0000 caller_uid=0, caller_gid=0{0,}) v4 mds.test.cephadm1.tsiqgp:Jan 31 09:07:24 cephadm1 ceph-mds[4117]: mds.2.server reply_client_request -13 ((13) Permission denied) client_request(client.234387:1497 lookupino #0x30000008465 2022-01-31T14:07:24.478378+0000 caller_uid=0, caller_gid=0{0,}) v4 mds.test.cephadm1.tsiqgp:Jan 31 09:07:24 cephadm1 ceph-mds[4117]: mds.2.1168 send_message_client client.234387 192.168.1.209:0/4024835920 client_reply(???:1497 = -13 (13) Permission denied) v1 mds.test.cephadm1.xrdhbp:Jan 31 09:06:15 cephadm1 ceph-mds[4168]: mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.234387:364 lookupino #0x30000008465 2022-01-31T14:06:15.864663+0000 caller_uid=0, caller_gid=0{0,}) v4 mds.test.cephadm1.xrdhbp:Jan 31 09:06:15 cephadm1 ceph-mds[4168]: mds.0.1168 send_message_client client.234387 192.168.1.209:0/4024835920 client_reply(???:364 = -13 (13) Permission denied) v1 mds.test.cephadm3.tglhaq:Jan 31 09:05:14 cephadm3 ceph-mds[2921]: mds.1.server reply_client_request -13 ((13) Permission denied) client_request(client.234387:2 lookupino #0x30000008465 2022-01-31T14:05:14.865090+0000 caller_uid=0, caller_gid=0{0,}) v4 mds.test.cephadm3.tglhaq:Jan 31 09:05:14 cephadm3 ceph-mds[2921]: mds.1.1168 send_message_client client.234387 192.168.1.209:0/4024835920 client_reply(???:2 = -13 (13) Permission denied) v1
But they are well before the link request and the inode number points to `/volumes/_nogroup/centos8/`
Also, the below failure:
[11770.089390] ceph: do_request 000000008d72dbb7 done, result -13
is a synchronous dirop request and the one of the places where -EACCES is returned in kclient is (mds_client.c):
if (session->s_state == CEPH_MDS_SESSION_REJECTED) { if (ceph_test_mount_opt(mdsc->fsc, CLEANRECOVER)) list_add(&req->r_wait, &mdsc->waiting_for_map); else err = -EACCES; goto out_session; }
But I don't think this client session was blocklisted since I cannot find the corresponding log entry in the mds log. So, I wonder where -EACCES is coming from.
Also, I could not reproduce it running generic/070. Would it be possible to share kernel buffer logs and the mds logs if you could reproduce it again? AFAICT, this requires a dir metadata to be replicated. I'll try reducing `mds_bal_replicate_threshold` and see if see if I run into it.
Updated by Jeff Layton about 2 years ago
Venky Shankar wrote:
But, I cannot see the unlink request coming in or a failure with -EACCES for the path in question. Failures with -EACCES are these:
[...]
But they are well before the link request and the inode number points to `/volumes/_nogroup/centos8/`
It's interesting that they are all "lookupino" requests. We only really do that in the client from NFS export handling code (and also maybe from quotarealm handling). I wonder if fsstress might be doing some name_to_handle_at/open_by_handle_at calls that tickle this bug?
In any case, do path-restricted caps prevent the client from doing a lookupino? The client isn't providing a path in that case, so to properly enforce the caps you'd need to do subtree checking, which basically means ensuring that you can reach one of the inode's dentries from a path to which you have access. That's extremely expensive in an NFS server, and I imagine it'd be similar here.
Updated by Jeff Layton about 2 years ago
Venky Shankar wrote:
Also, I could not reproduce it running generic/070. Would it be possible to share kernel buffer logs and the mds logs if you could reproduce it again? AFAICT, this requires a dir metadata to be replicated. I'll try reducing `mds_bal_replicate_threshold` and see if see if I run into it.
generic/013 reproduces it pretty much every time for me. I didn't do anything special to set up dir replication. I have max_mds set to 3 for both test and scratch filesystems. This is the script I use to blow away and recreate the subvols that I use for testing:
#!/bin/bash umount -t ceph -a ceph fs subvolume rm test ${HOSTNAME} ceph fs subvolume rm scratch ${HOSTNAME} ceph fs subvolume create test ${HOSTNAME} --namespace-isolated ceph fs subvolume create scratch ${HOSTNAME} --namespace-isolated ceph fs subvolume authorize test ${HOSTNAME} ${HOSTNAME} --allow-existing-id ceph fs subvolume authorize scratch ${HOSTNAME} ${HOSTNAME} --allow-existing-id ceph fs subvolume info test ${HOSTNAME} | jq -r '.path' ceph fs subvolume info scratch ${HOSTNAME} | jq -r '.path'
Updated by Jeff Layton about 2 years ago
I reproduced this morning again and gathered the mds logs:
ceph-post-file: bf0318cc-3e34-4d61-8895-03dfdae86c25
Updated by Jeff Layton about 2 years ago
FWIW, I also turned up kernel debug logs and did this:
# stat /mnt/test/fsstress.1621.1/p0/d0XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d15XXXXXXX/f5dXXXXXXX stat: cannot statx '/mnt/test/fsstress.1621.1/p0/d0XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d15XXXXXXX/f5dXXXXXXX': Permission denied
...that produced:
[ 651.823615] ceph: prepare_send_request 000000003409d9cc tid 6 lookup (attempt 1) [ 651.823618] ceph: dentry 000000004096f2c2 10000037709/f5dXXXXXXX [ 651.823622] ceph: r_parent = 000000004b91e0cd [ 651.823627] ceph: do_request waiting [ 651.824206] ceph: handle_reply 000000003409d9cc [ 651.824212] ceph: __unregister_request 000000003409d9cc tid 6 [ 651.824214] ceph: handle_reply tid 6 result -13 [ 651.824217] ceph: fill_trace 000000003409d9cc is_dentry 0 is_target 0 [ 651.824220] ceph: fill_trace reply is empty! [ 651.824221] ceph: unreserve caps ctx=00000000f49b2520 count=2 [ 651.824224] ceph: __ceph_unreserve_caps: caps 7 = 5 used + 0 resv + 2 avail [ 651.824249] ceph: do_request waited, got 0 [ 651.824253] ceph: do_request 000000003409d9cc done, result -13
...so we sent a LOOKUP to the MDS (I think for the last component since I can stat the parent) and got back -EACCES.
Updated by Jeff Layton about 2 years ago
I see this in the current batch of logs (mds.test.cephadm1.xlapqu.log):
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker rdlock_start on (isnap sync) on [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker got rdlock on (isnap sync r=1) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.cache path_traverse finish on snapid head Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server ref is [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker acquire_locks request(client.284338:41 nref=3 cr=0x562a0b6d62c0) Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker must rdlock (iauth sync) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker must rdlock (ixattr sync) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker rdlock_start on (iauth sync) on [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker got rdlock on (iauth sync r=1) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iauth sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=2 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker rdlock_start on (ixattr sync) on [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iauth sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=2 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker got rdlock on (ixattr sync r=1) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iauth sync r=1) (ixattr sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=3 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: Session check_access stray_prior_path Feb 22 08:50:38 cephadm1 ceph-mds[4043]: MDSAuthCap is_capable inode(path / owner 0:0 mode 0100666) by caller 4447:4447 mask 1 new 0:0 cap: MDSAuthCaps[allow rw path="/volumes/_nogroup/client1/ef0b8bdf-0e06-4c74-b523-e7e1cd6d61d8", allow rw path="/volumes/_nogroup/client1/8d9eebe8-c5bd-4d53-968b-25d1705f0d9a", allow rw path="/volumes/_nogroup/client1/81fd4273-03d1-43a9-9b9c-a3a1cd71626e", allow rw path="/volumes/_nogroup/client1/1a98f09f-c1fc-4b7d-bed9-c673b431bba1", allow rw path="/volumes/_nogroup/client1/f6589acf-b134-4b75-a222-bbc7197457f8", allow rw path="/volumes/_nogroup/client1/f4391bf8-865b-42c9-8541-103a6be949b0", allow rw path="/volumes/_nogroup/client1/45865a6d-7258-4184-9796-c922e787ebf2", allow rw path="/volumes/_nogroup/client1/5fe7c032-0b15-4b18-b634-2abab6b6ffc1", allow rw path="/volumes/_nogroup/client1/e994684e-a2bc-4270-8de1-7e461a77952a", allow rw path="/volumes/_nogroup/client1/7d04a246-bd3e-432d-b4ad-324fa31b4bfd", allow rw path="/volumes/_nogroup/client1/b071380c-d603-4008-b529-c132e53739ea", allow rw path="/volumes/_nogroup/client1/0650926a-4d4d-4332-9931-469a18c1b41b", allow rw path="/volumes/_nogroup/client1/670a813d-fcba-4fdf-b94f-ad154aba6833", allow rw path="/volumes/_nogroup/client1/38030a6e-b7d7-4613-9572-c919e7788641", allow rw path="/volumes/_nogroup/client1/79beb7ea-0713-42c4-909a-a55b7a4daf45", allow rw path="/volumes/_nogroup/client1/37e55b93-b98c-4ae6-a0f2-824b9c076a48", allow rw path="/volumes/_nogroup/client1/6658fab0-f31b-401e-9f8a-911a6e1f08af", allow rw path="/volumes/_nogroup/client1/6f1acc15-566e-4a0b-b494-c7e9b0d0b8a5", allow rw path="/volumes/_nogroup/client1/fe9abf2c-a7e2-4cf7-96e7-6ef207cc2f23", allow rw path="/volumes/_nogroup/client1/3b115ab7-4cc9-4bce-b3b9-9d98f9c3ca67", allow rw path="/volumes/_nogroup/client1/e5349eca-a30b-4345-a028-c2f8031b8dfe", allow rw path="/volumes/_nogroup/client1/44aaceca-faf7-4dd1-ad1c-c4caa874e0c0"] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server respond_to_request batch head request(client.284338:41 nref=3 cr=0x562a0b6d62c0) Feb 22 08:50:38 cephadm1 ceph-mds[4043]: respond: responding to batch ops with result=-13: [batch front=request(client.284338:41 nref=3 cr=0x562a0b6d62c0)] Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.284338:41 lookup #0x10000037709/f1fX 2022-02-22T13:50:38.539147+0000 caller_uid=4447, caller_gid=4447{10,4447,}) v4 Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server apply_allocated_inos 0x0 / [] / 0x0 Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server lat 0.000357 Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.774 send_message_client client.284338 192.168.1.77:0/2833116500 client_reply(???:41 = -13 (13) Permission denied) v1 Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.cache request_finish request(client.284338:41 nref=3 cr=0x562a0b6d62c0)
...so one interesting thing is that when you remove a subvolume, it apparently doesn't clean up caps associated with that path. That's possibly ugly after a long uptime. I'll plan to open a new bug for that.
Beyond that though, I suspect the that MDSAuthCaps::is_capable is returning false and that makes Session::check_access return -EACCES. The path in this case is "/" though, so that return seems appropriate. Note too the "stray_prior_path" message in there, which comes from here. So the diri in this case is_stray:
if (!in->is_base()) diri = in->get_projected_parent_dn()->get_dir()->get_inode(); if (diri && diri->is_stray()){ path = in->get_projected_inode()->stray_prior_path; dout(20) << __func__ << " stray_prior_path " << path << dendl; } else { in->make_path_string(path, true); dout(20) << __func__ << " path " << path << dendl; }
One thing I've come to understand recently is that the client often has to access inodes that are in the stray dirs. Perhaps path-restricted caps are preventing that?
Updated by Venky Shankar about 2 years ago
Jeff Layton wrote:
I see this in the current batch of logs (mds.test.cephadm1.xlapqu.log):
[...]
...so one interesting thing is that when you remove a subvolume, it apparently doesn't clean up caps associated with that path. That's possibly ugly after a long uptime. I'll plan to open a new bug for that.
Beyond that though, I suspect the that MDSAuthCaps::is_capable is returning false and that makes Session::check_access return -EACCES. The path in this case is "/"
`stray_prior_path` looks empty which is causing this bug. It should hold the path prior to the unlink.
I'll post a fix.
though, so that return seems appropriate. Note too the "stray_prior_path" message in there, which comes from here. So the diri in this case is_stray:
[...]
One thing I've come to understand recently is that the client often has to access inodes that are in the stray dirs. Perhaps path-restricted caps are preventing that?
Updated by Venky Shankar about 2 years ago
Hitting this bug involves having hardlinks to inodes which are authoritative in another active mds. When a non-primary hardlink inode is unlinked, the mds for this (non-primary) inode sends a UNLINKPREP message to the authoritative mds of the primary inode. This unlike the local unlink (where the auth mds would not be a different mds) does not set ->stray_prior_path field in the inode, leaving it empty.
There is something else on-going too in the mds which is a bit puzzling, which I have no explanation at-the-moment - I'm figuring that out currently.
Will update when get to the root of it.
Updated by Venky Shankar about 2 years ago
OK - super easy to reproduce with some directory pinning. steps -
(requires a ceph user with path restricted caps)
- mkdir d0 d1 - touch d0/foo - ln d0/foo d1/bar - ln d0/foo d1/baz - setfattr -n ceph.dir.pin -v 0 d0 - setfattr -n ceph.dir.pin -v 1 d1
Fisrt, remove the (first) remote link:
- rm -f d1/bar
at this point the state of the inode in rank1 is (`dump inode'):
"path": "/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d1/bar" ... ... "stray_prior_path": "",
Now, remove the primary link and the (other) remote link:
rm -f /mnt/cephfs/d0/foo && rm -f /mnt/cephfs/d1/baz rm: cannot remove '/mnt/cephfs/d1/baz': Permission denied rm -f /mnt/cephfs/d1/baz rm: cannot remove '/mnt/cephfs/d1/baz': Permission denied stat /mnt/cephfs/d1/baz stat: cannot stat '/mnt/cephfs/d1/baz': Permission denied
State of the inode in rank1:
"path": "~mds0/stray0/1000000001c", ... ... "stray_prior_path": "",
A stray inode with an empty `stray_prior_path' - SessionMap::check_access() does not like that.
Unlinking the remote link (first time) does not store `stray_prior_path' (that's fine). Then unlinking the primary link results in the mds (auth of primary link) to send a `dentry_unlink' message to its replicas. This kicks the inode (it replica MDSs) to update the inode path with the stray directory path (of the sender mds), however, `stray_prior_path' still remain empty.
As far as the fix in concerned, once an inode becomes a stray `stray_prior_path' should point to its path before unlink.
Updated by Venky Shankar about 2 years ago
The issue happens the auth mds for the primary hardlink sends a `dentry_unlink' message to replica MDSs and the replica MDSs uses this inode to resolve a lookup. Here are the sequence of events/messages - primary dentry 0x10000000018/f0000 is auth on mds.c and dentry 0x10000000019/ba2222 is auth on mds.b.
(mds.c) `_unlink_local' called for 0x10000000018/f0000:
2022-03-09T01:16:23.459-0500 7f6df3df2700 10 mds.0.server _unlink_local [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] auth{1=1} (dn xlock x=1 by 0x55626f4e4000) (dversion lock w=1 last_client=214131) v=147 ap=2 ino=0x10000000022 state=1610612736 | request=1 lock=2 inodepin=1 replicated=1 dirty=1 waiter=0 authpin=1 0x55626f340000]
(mds.c) `send_dentry_unlink' send `dentry_unlink' message to replica MDS (mds.b) and replies back to client:
2022-03-09T01:16:27.219-0500 7f6dedde6700 10 mds.0.cache send_dentry_unlink [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] auth{1=1} NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=214131) v=149 ap=2 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 rep\ licated=1 dirty=1 waiter=0 authpin=1 clientlease=1 0x55626f340000] 2022-03-09T01:16:27.219-0500 7f6dedde6700 20 mds.0.cache.ino(0x100) encode_snap_blob snaprealm(0x100 seq 1 lc 0 cr 0 cps 1 snaps={} past_parent_snaps= 0x55626a5a1400) 2022-03-09T01:16:27.219-0500 7f6dedde6700 1 -- [v2:192.168.0.5:6830/2802097374,v1:192.168.0.5:6831/2802097374] send_to--> mds [v2:192.168.0.5:6828/1587915794,v1:192.168.0.5:6829/1587915794] -- dentry_unlink(0x10000000018 f0000) v1 -- ?+0 0x55626f442e00 2022-03-09T01:16:27.219-0500 7f6dedde6700 1 -- [v2:192.168.0.5:6830/2802097374,v1:192.168.0.5:6831/2802097374] --> [v2:192.168.0.5:6828/1587915794,v1:192.168.0.5:6829/1587915794] -- dentry_unlink(0x10000000018 f0000) v1 -- 0x55626f442e00 con 0x55626f2e6800 2022-03-09T01:16:27.219-0500 7f6dedde6700 20 mds.0.bal hit_dir 1 pop is 1.0741, frag * size 0 [pop IRD:[C 5.93e-01] IWR:[C 1.07e+00] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:2.7] 2022-03-09T01:16:27.219-0500 7f6dedde6700 7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.214131:11 unlink #0x10000000018/f0000 2022-03-09T01:16:23.452163-0500 caller_uid=0, caller_gid=0{0,}) v4
(mds.b) lookup request for 0x10000000019/ba2222:
2022-03-09T01:16:23.471-0500 7fd915a11700 1 -- [v2:192.168.0.5:6828/1587915794,v1:192.168.0.5:6829/1587915794] <== client.214131 192.168.0.3:0/2094212419 19 ==== client_request(client.214131:12 lookup #0x10000000019/ba2222 2022-03-09T01:16:23.464163-0500 caller_uid=0, caller_gid=0{0,}) v4 ==== 150+0+0 (crc 0 0 0)\ 0x56210015d600 con 0x562100010400
(mds.b) MDCache::path_traverse() finds the inode in its cache looked-up from the first unlink operation. This inode is not a stray inode, thereby not resulting in stray reintegration.
2022-03-09T01:16:23.471-0500 7fd915a11700 12 mds.1.cache traverse: path seg depth 0 'ba2222' snapid head 2022-03-09T01:16:23.471-0500 7fd915a11700 20 mds.1.cache.dir(0x10000000019) lookup (ba2222, 'head') 2022-03-09T01:16:23.471-0500 7fd915a11700 20 mds.1.cache.dir(0x10000000019) hit -> (ba2222,head) 2022-03-09T01:16:23.471-0500 7fd915a11700 7 mds.1.cache linking in remote in [inode 0x10000000022 [...2,head] /volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 rep@0.1 v145 s=0 nl=2 n(v0 rc2022-03-09T01:13:41.938796-0500 1=1+0) (ilink lock) (iversion lock) 0x562100014c00] 2022-03-09T01:16:23.471-0500 7fd915a11700 10 mds.1.cache.strays eval_remote [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d1/ba2222 [2,head] auth REMOTE(reg) (dversion lock) v=37237 ino=0x10000000022 state=1610612736 | inodepin=0 dirty=1 0x5620ffff5680] 2022-03-09T01:16:23.471-0500 7fd915a11700 20 mds.1.cache.strays eval_remote: inode's primary dn not stray 2022-03-09T01:16:23.471-0500 7fd915a11700 10 mds.1.cache path_traverse finish on snapid head
mds.b sends various messages to mds.c, thereby handle_client_getattr getting called multiple times.
(mds.b) `dentry_unlink' message is received which links the stray inode in the dentry:
2022-03-09T01:16:27.219-0500 7fd915a11700 7 mds.1.cache handle_dentry_unlink on [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] rep@0.1 (dn lock) (dversion lock) v=0 ino=0x10000000022 state=0 | inodepin=1 0x56210009ef00] 2022-03-09T01:16:27.219-0500 7fd915a11700 12 mds.1.cache.dir(0x10000000018) unlink_inode [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] rep@0.1 (dn lock) (dversion lock) v=0 ino=0x10000000022 state=0 | inodepin=1 0x56210009ef00] [inode 0x10000000022 [...2,head] /volume\ s/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 rep@0.1 v145 s=0 nl=2 n(v0 rc2022-03-09T01:13:41.938796-0500 1=1+0) (ilink lock) (iversion lock) | request=1 lock=0 caps=0 remoteparent=1 waiter=1 0x562100014c00] 2022-03-09T01:16:27.219-0500 7fd915a11700 12 mds.1.cache.dir(0x600) link_primary_inode [dentry #0x100/stray0/10000000022 [2,head] rep@0.1 NULL (dn lock) (dversion lock) v=0 ino=(nil) state=64|bottomlru 0x56210009fb80] [inode 0x10000000022 [...2,head] #10000000022 rep@-2.1 v145 s=0 nl=2 n(v0 rc2022-03-09T01:13:41.9\ 38796-0500 1=1+0) (ilink lock) (iversion lock) | request=1 lock=0 caps=0 remoteparent=1 waiter=1 0x562100014c00]
The inode gets linked with an empty `stray_prior_path'. The auth mds should have encoded this (inode_t::encode), so there is a bug somewhere...
Updated by Venky Shankar about 2 years ago
- Status changed from New to Pending Backport
Updated by Backport Bot about 2 years ago
- Copied to Backport #55427: pacific: unaccessible dentries after fsstress run with namespace-restricted caps added
Updated by Backport Bot about 2 years ago
- Copied to Backport #55428: quincy: unaccessible dentries after fsstress run with namespace-restricted caps added
Updated by Konstantin Shalygin over 1 year ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100
- Tags deleted (
backport_processed)