Project

General

Profile

Bug #54046

unaccessible dentries after fsstress run with namespace-restricted caps

Added by Jeff Layton 10 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Security Model
Target version:
% Done:

100%

Source:
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I did an xfstests run in a subvolume with namespace restricted caps. Test generic/070 does an fsstress run, and then tries to delete the tree that it creates. Some of those deletes failed with permission denied errors:

    +rm: cannot remove '/mnt/test/fsstress/p0/d64/d29e/d787/da0a/db79/db99/leae': Permission denied
    +rm: cannot remove '/mnt/test/fsstress/p0/d64/d29e/d787/da0a/db79/db99/c297': Permission denied
    +rm: cannot remove '/mnt/test/fsstress/p0/d64/d802/f976': Permission denied
    +rm: cannot remove '/mnt/test/fsstress/p0/d64/f7d8': Permission denied
    +rm: cannot remove '/mnt/test/fsstress/p0/d64/f3b5': Permission denied

...and I get the same error when I try to stat the dentries:

statx(AT_FDCWD, "c297", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7ffce516e6f0) = -1 EACCES (Permission denied)

Turning up debugging on the client shows the MDS rejecting the delete request with EACCES:

[11770.089390] ceph:  do_request 000000008d72dbb7 done, result -13

...remounting the subvolume doesn't help, but if I unmount, reboot the entire cluster (thus restarting all of the MDS's), then I can get to them again:

statx(AT_FDCWD, "/mnt/test/fsstress/p0/d64/d29e/d787/da0a/db79/db99/c297", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFCHR|0444, stx_size=0, ...}) = 0

Related issues

Copied to CephFS - Backport #55427: pacific: unaccessible dentries after fsstress run with namespace-restricted caps Resolved
Copied to CephFS - Backport #55428: quincy: unaccessible dentries after fsstress run with namespace-restricted caps Resolved

History

#1 Updated by Jeff Layton 10 months ago

  • Description updated (diff)

#2 Updated by Jeff Layton 10 months ago

The MDS I'm testing against has the latest fscrypt patchset based on top of this commit from ceph mainline:

9e8b2d8282e9 Merge pull request #44510 from rzarzynski/wip-cephadm-docfix

I doubt the fscrypt changes are the issue, but I can try testing a different cluster version if need be. I can collect debug logs, but there are 3 MDSs for this fs and fsstress generates a heavy workload. I might need some help targeting what you need.

#3 Updated by Greg Farnum 10 months ago

  • Category set to Security Model
  • Component(FS) MDS added

This issue should be pretty apparent from MDS logs where we can trace out the incoming client requests.

Jeff, I think you said in irc that this is run against a multimds setup? That may turn out to be important since I bet it's an issue with the path restriction and subtrees.

#4 Updated by Jeff Layton 10 months ago

Yes. I have max_mds set to 3, and I have 4 per fs (with 1 standby).

The good news is that this seems to be pretty easy to reproduce so far.

#5 Updated by Dan van der Ster 10 months ago

FWIW none of our users have noticed this -- all have path/namespace restricted caps. (octopus multi-mds, pacific single mds; and centos + fedora atomic clients)

#6 Updated by Greg Farnum 10 months ago

Jeff Layton wrote:

Yes. I have max_mds set to 3, and I have 4 per fs (with 1 standby).

The good news is that this seems to be pretty easy to reproduce so far.

Can you do so with "debug mds = 20" set and then post the logs, along with whatever failures you're getting on the client side so we know what files and requests to trace? :)

#7 Updated by Jeff Layton 10 months ago

$ cat /home/jlayton/git/xfstests-dev/results//generic/070.out.bad
QA output created by 070
rm: cannot remove '/mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19': Permission denied
rm: cannot remove '/mnt/test/fsstress/p0/dba/d5ad/d65b/d7bb/d242/f98b': Permission denied
$ stat /mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19
stat: cannot statx '/mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19': Permission denied

Unfortunately, the logs are too large to attach here. Where would you like them?

#8 Updated by Jeff Layton 10 months ago

Logfiles are in a tarball at:

ceph-post-file: 20043459-1412-4880-8844-8f575228d1fb

#9 Updated by Venky Shankar 9 months ago

  • Assignee set to Venky Shankar
  • Target version set to v18.0.0
  • Backport set to quincy, pacific

#10 Updated by Venky Shankar 9 months ago

Jeff,

For this unlink request

rm: cannot remove '/mnt/test/fsstress/p0/dba/d35e/d7bc/d8fd/ff19': Permission denied

there is no corresponding mds unlink request or a path lookup failure in the MDS. "ff19" is a hardlink which has a corresponding mds request:

Jan 31 09:07:43 cephadm1 ceph-mds[4168]: mds.0.server handle_client_request client_request(client.234387:1752 link #0x100000402a0/ff19 #0x1000004028c/fef4 2022-01-31T14:07:43.911857+0000 caller_uid=0, caller_gid=0{0,}) v4

and its reply:

Jan 31 09:07:43 cephadm1 ceph-mds[4168]: mds.0.server reply_client_request 0 ((0) Success) client_request(client.234387:1752 link #0x100000402a0/ff19 #0x1000004028c/fef4 2022-01-31T14:07:43.911857+0000 caller_uid=0, caller_gid=0{0,}) v4

But, I cannot see the unlink request coming in or a failure with -EACCES for the path in question. Failures with -EACCES are these:

mds.test.cephadm1.tsiqgp:Jan 31 09:03:26 cephadm1 ceph-mds[4117]: mds.2.server reply_client_request -13 ((13) Permission denied) client_request(client.234333:5345 lookupino #0x30000008465 2022-01-31T14:03:26.757258+0000 caller_uid=0, caller_gid=0{0,}) v4
mds.test.cephadm1.tsiqgp:Jan 31 09:03:26 cephadm1 ceph-mds[4117]: mds.2.1168 send_message_client client.234333 192.168.1.209:0/3132864089 client_reply(???:5345 = -13 (13) Permission denied) v1
mds.test.cephadm1.tsiqgp:Jan 31 09:03:35 cephadm1 ceph-mds[4117]: mds.2.server reply_client_request -13 ((13) Permission denied) client_request(client.234354:21 lookupino #0x30000008465 2022-01-31T14:03:35.239479+0000 caller_uid=0, caller_gid=0{0,}) v4
mds.test.cephadm1.tsiqgp:Jan 31 09:07:24 cephadm1 ceph-mds[4117]: mds.2.server reply_client_request -13 ((13) Permission denied) client_request(client.234387:1497 lookupino #0x30000008465 2022-01-31T14:07:24.478378+0000 caller_uid=0, caller_gid=0{0,}) v4
mds.test.cephadm1.tsiqgp:Jan 31 09:07:24 cephadm1 ceph-mds[4117]: mds.2.1168 send_message_client client.234387 192.168.1.209:0/4024835920 client_reply(???:1497 = -13 (13) Permission denied) v1
mds.test.cephadm1.xrdhbp:Jan 31 09:06:15 cephadm1 ceph-mds[4168]: mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.234387:364 lookupino #0x30000008465 2022-01-31T14:06:15.864663+0000 caller_uid=0, caller_gid=0{0,}) v4
mds.test.cephadm1.xrdhbp:Jan 31 09:06:15 cephadm1 ceph-mds[4168]: mds.0.1168 send_message_client client.234387 192.168.1.209:0/4024835920 client_reply(???:364 = -13 (13) Permission denied) v1
mds.test.cephadm3.tglhaq:Jan 31 09:05:14 cephadm3 ceph-mds[2921]: mds.1.server reply_client_request -13 ((13) Permission denied) client_request(client.234387:2 lookupino #0x30000008465 2022-01-31T14:05:14.865090+0000 caller_uid=0, caller_gid=0{0,}) v4
mds.test.cephadm3.tglhaq:Jan 31 09:05:14 cephadm3 ceph-mds[2921]: mds.1.1168 send_message_client client.234387 192.168.1.209:0/4024835920 client_reply(???:2 = -13 (13) Permission denied) v1

But they are well before the link request and the inode number points to `/volumes/_nogroup/centos8/`

Also, the below failure:

[11770.089390] ceph:  do_request 000000008d72dbb7 done, result -13

is a synchronous dirop request and the one of the places where -EACCES is returned in kclient is (mds_client.c):

                if (session->s_state == CEPH_MDS_SESSION_REJECTED) {
                        if (ceph_test_mount_opt(mdsc->fsc, CLEANRECOVER))
                                list_add(&req->r_wait, &mdsc->waiting_for_map);
                        else
                                err = -EACCES;
                        goto out_session;
                }

But I don't think this client session was blocklisted since I cannot find the corresponding log entry in the mds log. So, I wonder where -EACCES is coming from.

Also, I could not reproduce it running generic/070. Would it be possible to share kernel buffer logs and the mds logs if you could reproduce it again? AFAICT, this requires a dir metadata to be replicated. I'll try reducing `mds_bal_replicate_threshold` and see if see if I run into it.

#11 Updated by Jeff Layton 9 months ago

Venky Shankar wrote:

But, I cannot see the unlink request coming in or a failure with -EACCES for the path in question. Failures with -EACCES are these:

[...]

But they are well before the link request and the inode number points to `/volumes/_nogroup/centos8/`

It's interesting that they are all "lookupino" requests. We only really do that in the client from NFS export handling code (and also maybe from quotarealm handling). I wonder if fsstress might be doing some name_to_handle_at/open_by_handle_at calls that tickle this bug?

In any case, do path-restricted caps prevent the client from doing a lookupino? The client isn't providing a path in that case, so to properly enforce the caps you'd need to do subtree checking, which basically means ensuring that you can reach one of the inode's dentries from a path to which you have access. That's extremely expensive in an NFS server, and I imagine it'd be similar here.

#12 Updated by Jeff Layton 9 months ago

Venky Shankar wrote:

Also, I could not reproduce it running generic/070. Would it be possible to share kernel buffer logs and the mds logs if you could reproduce it again? AFAICT, this requires a dir metadata to be replicated. I'll try reducing `mds_bal_replicate_threshold` and see if see if I run into it.

generic/013 reproduces it pretty much every time for me. I didn't do anything special to set up dir replication. I have max_mds set to 3 for both test and scratch filesystems. This is the script I use to blow away and recreate the subvols that I use for testing:

#!/bin/bash

umount -t ceph -a

ceph fs subvolume rm test ${HOSTNAME}
ceph fs subvolume rm scratch ${HOSTNAME}

ceph fs subvolume create test ${HOSTNAME} --namespace-isolated
ceph fs subvolume create scratch ${HOSTNAME} --namespace-isolated

ceph fs subvolume authorize test ${HOSTNAME} ${HOSTNAME} --allow-existing-id
ceph fs subvolume authorize scratch ${HOSTNAME} ${HOSTNAME} --allow-existing-id

ceph fs subvolume info test ${HOSTNAME} | jq -r '.path'
ceph fs subvolume info scratch ${HOSTNAME} | jq -r '.path'

#13 Updated by Venky Shankar 9 months ago

Thx! I'll try that.

#14 Updated by Jeff Layton 9 months ago

I reproduced this morning again and gathered the mds logs:

ceph-post-file: bf0318cc-3e34-4d61-8895-03dfdae86c25

#15 Updated by Jeff Layton 9 months ago

FWIW, I also turned up kernel debug logs and did this:

# stat /mnt/test/fsstress.1621.1/p0/d0XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d15XXXXXXX/f5dXXXXXXX
stat: cannot statx '/mnt/test/fsstress.1621.1/p0/d0XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/d15XXXXXXX/f5dXXXXXXX': Permission denied

...that produced:

[  651.823615] ceph:  prepare_send_request 000000003409d9cc tid 6 lookup (attempt 1)
[  651.823618] ceph:   dentry 000000004096f2c2 10000037709/f5dXXXXXXX
[  651.823622] ceph:   r_parent = 000000004b91e0cd
[  651.823627] ceph:  do_request waiting
[  651.824206] ceph:  handle_reply 000000003409d9cc
[  651.824212] ceph:  __unregister_request 000000003409d9cc tid 6
[  651.824214] ceph:  handle_reply tid 6 result -13
[  651.824217] ceph:  fill_trace 000000003409d9cc is_dentry 0 is_target 0
[  651.824220] ceph:  fill_trace reply is empty!
[  651.824221] ceph:  unreserve caps ctx=00000000f49b2520 count=2
[  651.824224] ceph:  __ceph_unreserve_caps: caps 7 = 5 used + 0 resv + 2 avail
[  651.824249] ceph:  do_request waited, got 0
[  651.824253] ceph:  do_request 000000003409d9cc done, result -13

...so we sent a LOOKUP to the MDS (I think for the last component since I can stat the parent) and got back -EACCES.

#16 Updated by Jeff Layton 9 months ago

I see this in the current batch of logs (mds.test.cephadm1.xlapqu.log):

Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker rdlock_start  on (isnap sync) on [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker  got rdlock on (isnap sync r=1) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.cache path_traverse finish on snapid head
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server ref is [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker acquire_locks request(client.284338:41 nref=3 cr=0x562a0b6d62c0)
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker  must rdlock (iauth sync) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker  must rdlock (ixattr sync) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker rdlock_start  on (iauth sync) on [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=1 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker  got rdlock on (iauth sync r=1) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iauth sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=2 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker rdlock_start  on (ixattr sync) on [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iauth sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=2 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.locker  got rdlock on (ixattr sync r=1) [inode 0x10000037707 [2,head] ~mds2/stray0/10000037707 rep@2.1 v465047 s=4194304 n(v0 rc2022-02-22T13:28:49.871445+0000 b4194304 1=1+0) (iauth sync r=1) (ixattr sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=3 caps=0 remoteparent=1 exportingcaps=0 dirtyparent=0 dirty=0 authpin=0 tempexporting=0 0x562a094fbb80]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: Session check_access stray_prior_path 
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: MDSAuthCap is_capable inode(path / owner 0:0 mode 0100666) by caller 4447:4447 mask 1 new 0:0 cap: MDSAuthCaps[allow rw path="/volumes/_nogroup/client1/ef0b8bdf-0e06-4c74-b523-e7e1cd6d61d8", allow rw path="/volumes/_nogroup/client1/8d9eebe8-c5bd-4d53-968b-25d1705f0d9a", allow rw path="/volumes/_nogroup/client1/81fd4273-03d1-43a9-9b9c-a3a1cd71626e", allow rw path="/volumes/_nogroup/client1/1a98f09f-c1fc-4b7d-bed9-c673b431bba1", allow rw path="/volumes/_nogroup/client1/f6589acf-b134-4b75-a222-bbc7197457f8", allow rw path="/volumes/_nogroup/client1/f4391bf8-865b-42c9-8541-103a6be949b0", allow rw path="/volumes/_nogroup/client1/45865a6d-7258-4184-9796-c922e787ebf2", allow rw path="/volumes/_nogroup/client1/5fe7c032-0b15-4b18-b634-2abab6b6ffc1", allow rw path="/volumes/_nogroup/client1/e994684e-a2bc-4270-8de1-7e461a77952a", allow rw path="/volumes/_nogroup/client1/7d04a246-bd3e-432d-b4ad-324fa31b4bfd", allow rw path="/volumes/_nogroup/client1/b071380c-d603-4008-b529-c132e53739ea", allow rw path="/volumes/_nogroup/client1/0650926a-4d4d-4332-9931-469a18c1b41b", allow rw path="/volumes/_nogroup/client1/670a813d-fcba-4fdf-b94f-ad154aba6833", allow rw path="/volumes/_nogroup/client1/38030a6e-b7d7-4613-9572-c919e7788641", allow rw path="/volumes/_nogroup/client1/79beb7ea-0713-42c4-909a-a55b7a4daf45", allow rw path="/volumes/_nogroup/client1/37e55b93-b98c-4ae6-a0f2-824b9c076a48", allow rw path="/volumes/_nogroup/client1/6658fab0-f31b-401e-9f8a-911a6e1f08af", allow rw path="/volumes/_nogroup/client1/6f1acc15-566e-4a0b-b494-c7e9b0d0b8a5", allow rw path="/volumes/_nogroup/client1/fe9abf2c-a7e2-4cf7-96e7-6ef207cc2f23", allow rw path="/volumes/_nogroup/client1/3b115ab7-4cc9-4bce-b3b9-9d98f9c3ca67", allow rw path="/volumes/_nogroup/client1/e5349eca-a30b-4345-a028-c2f8031b8dfe", allow rw path="/volumes/_nogroup/client1/44aaceca-faf7-4dd1-ad1c-c4caa874e0c0"]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server respond_to_request batch head request(client.284338:41 nref=3 cr=0x562a0b6d62c0)
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: respond: responding to batch ops with result=-13: [batch front=request(client.284338:41 nref=3 cr=0x562a0b6d62c0)]
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.284338:41 lookup #0x10000037709/f1fX 2022-02-22T13:50:38.539147+0000 caller_uid=4447, caller_gid=4447{10,4447,}) v4
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server apply_allocated_inos 0x0 / [] / 0x0
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.server lat 0.000357
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.774 send_message_client client.284338 192.168.1.77:0/2833116500 client_reply(???:41 = -13 (13) Permission denied) v1
Feb 22 08:50:38 cephadm1 ceph-mds[4043]: mds.0.cache request_finish request(client.284338:41 nref=3 cr=0x562a0b6d62c0)

...so one interesting thing is that when you remove a subvolume, it apparently doesn't clean up caps associated with that path. That's possibly ugly after a long uptime. I'll plan to open a new bug for that.

Beyond that though, I suspect the that MDSAuthCaps::is_capable is returning false and that makes Session::check_access return -EACCES. The path in this case is "/" though, so that return seems appropriate. Note too the "stray_prior_path" message in there, which comes from here. So the diri in this case is_stray:

  if (!in->is_base())
    diri = in->get_projected_parent_dn()->get_dir()->get_inode();
  if (diri && diri->is_stray()){
    path = in->get_projected_inode()->stray_prior_path;
    dout(20) << __func__ << " stray_prior_path " << path << dendl;
  } else {
    in->make_path_string(path, true);
    dout(20) << __func__ << " path " << path << dendl;
  }

One thing I've come to understand recently is that the client often has to access inodes that are in the stray dirs. Perhaps path-restricted caps are preventing that?

#17 Updated by Venky Shankar 9 months ago

Jeff Layton wrote:

I see this in the current batch of logs (mds.test.cephadm1.xlapqu.log):

[...]

...so one interesting thing is that when you remove a subvolume, it apparently doesn't clean up caps associated with that path. That's possibly ugly after a long uptime. I'll plan to open a new bug for that.

Beyond that though, I suspect the that MDSAuthCaps::is_capable is returning false and that makes Session::check_access return -EACCES. The path in this case is "/"

`stray_prior_path` looks empty which is causing this bug. It should hold the path prior to the unlink.

I'll post a fix.

though, so that return seems appropriate. Note too the "stray_prior_path" message in there, which comes from here. So the diri in this case is_stray:

[...]

One thing I've come to understand recently is that the client often has to access inodes that are in the stray dirs. Perhaps path-restricted caps are preventing that?

#18 Updated by Venky Shankar 9 months ago

Hitting this bug involves having hardlinks to inodes which are authoritative in another active mds. When a non-primary hardlink inode is unlinked, the mds for this (non-primary) inode sends a UNLINKPREP message to the authoritative mds of the primary inode. This unlike the local unlink (where the auth mds would not be a different mds) does not set ->stray_prior_path field in the inode, leaving it empty.

There is something else on-going too in the mds which is a bit puzzling, which I have no explanation at-the-moment - I'm figuring that out currently.

Will update when get to the root of it.

#19 Updated by Venky Shankar 9 months ago

OK - super easy to reproduce with some directory pinning. steps -
(requires a ceph user with path restricted caps)

- mkdir d0 d1
- touch d0/foo
- ln d0/foo d1/bar
- ln d0/foo d1/baz
- setfattr -n ceph.dir.pin -v 0 d0
- setfattr -n ceph.dir.pin -v 1 d1

Fisrt, remove the (first) remote link:

- rm -f d1/bar

at this point the state of the inode in rank1 is (`dump inode'):

"path": "/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d1/bar" 
...
...
"stray_prior_path": "",

Now, remove the primary link and the (other) remote link:

rm -f /mnt/cephfs/d0/foo && rm -f /mnt/cephfs/d1/baz
rm: cannot remove '/mnt/cephfs/d1/baz': Permission denied
rm -f /mnt/cephfs/d1/baz
rm: cannot remove '/mnt/cephfs/d1/baz': Permission denied
stat /mnt/cephfs/d1/baz
stat: cannot stat '/mnt/cephfs/d1/baz': Permission denied

State of the inode in rank1:

"path": "~mds0/stray0/1000000001c",
...
...
"stray_prior_path": "",

A stray inode with an empty `stray_prior_path' - SessionMap::check_access() does not like that.

Unlinking the remote link (first time) does not store `stray_prior_path' (that's fine). Then unlinking the primary link results in the mds (auth of primary link) to send a `dentry_unlink' message to its replicas. This kicks the inode (it replica MDSs) to update the inode path with the stray directory path (of the sender mds), however, `stray_prior_path' still remain empty.

As far as the fix in concerned, once an inode becomes a stray `stray_prior_path' should point to its path before unlink.

#20 Updated by Venky Shankar 9 months ago

The issue happens the auth mds for the primary hardlink sends a `dentry_unlink' message to replica MDSs and the replica MDSs uses this inode to resolve a lookup. Here are the sequence of events/messages - primary dentry 0x10000000018/f0000 is auth on mds.c and dentry 0x10000000019/ba2222 is auth on mds.b.

(mds.c) `_unlink_local' called for 0x10000000018/f0000:

2022-03-09T01:16:23.459-0500 7f6df3df2700 10 mds.0.server _unlink_local [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] auth{1=1} (dn xlock x=1 by 0x55626f4e4000) (dversion lock w=1 last_client=214131) v=147 ap=2 ino=0x10000000022 state=1610612736 | request=1 lock=2 inodepin=1 replicated=1 dirty=1 waiter=0 authpin=1 0x55626f340000]

(mds.c) `send_dentry_unlink' send `dentry_unlink' message to replica MDS (mds.b) and replies back to client:

2022-03-09T01:16:27.219-0500 7f6dedde6700 10 mds.0.cache send_dentry_unlink [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] auth{1=1} NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=214131) v=149 ap=2 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 rep\
licated=1 dirty=1 waiter=0 authpin=1 clientlease=1 0x55626f340000]
2022-03-09T01:16:27.219-0500 7f6dedde6700 20 mds.0.cache.ino(0x100) encode_snap_blob snaprealm(0x100 seq 1 lc 0 cr 0 cps 1 snaps={} past_parent_snaps= 0x55626a5a1400)
2022-03-09T01:16:27.219-0500 7f6dedde6700  1 -- [v2:192.168.0.5:6830/2802097374,v1:192.168.0.5:6831/2802097374] send_to--> mds [v2:192.168.0.5:6828/1587915794,v1:192.168.0.5:6829/1587915794] -- dentry_unlink(0x10000000018 f0000) v1 -- ?+0 0x55626f442e00
2022-03-09T01:16:27.219-0500 7f6dedde6700  1 -- [v2:192.168.0.5:6830/2802097374,v1:192.168.0.5:6831/2802097374] --> [v2:192.168.0.5:6828/1587915794,v1:192.168.0.5:6829/1587915794] -- dentry_unlink(0x10000000018 f0000) v1 -- 0x55626f442e00 con 0x55626f2e6800
2022-03-09T01:16:27.219-0500 7f6dedde6700 20 mds.0.bal hit_dir 1 pop is 1.0741, frag * size 0 [pop IRD:[C 5.93e-01] IWR:[C 1.07e+00] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:2.7]

2022-03-09T01:16:27.219-0500 7f6dedde6700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.214131:11 unlink #0x10000000018/f0000 2022-03-09T01:16:23.452163-0500 caller_uid=0, caller_gid=0{0,}) v4

(mds.b) lookup request for 0x10000000019/ba2222:

2022-03-09T01:16:23.471-0500 7fd915a11700  1 -- [v2:192.168.0.5:6828/1587915794,v1:192.168.0.5:6829/1587915794] <== client.214131 192.168.0.3:0/2094212419 19 ==== client_request(client.214131:12 lookup #0x10000000019/ba2222 2022-03-09T01:16:23.464163-0500 caller_uid=0, caller_gid=0{0,}) v4 ==== 150+0+0 (crc 0 0 0)\
 0x56210015d600 con 0x562100010400

(mds.b) MDCache::path_traverse() finds the inode in its cache looked-up from the first unlink operation. This inode is not a stray inode, thereby not resulting in stray reintegration.

2022-03-09T01:16:23.471-0500 7fd915a11700 12 mds.1.cache traverse: path seg depth 0 'ba2222' snapid head
2022-03-09T01:16:23.471-0500 7fd915a11700 20 mds.1.cache.dir(0x10000000019) lookup (ba2222, 'head')
2022-03-09T01:16:23.471-0500 7fd915a11700 20 mds.1.cache.dir(0x10000000019)   hit -> (ba2222,head)
2022-03-09T01:16:23.471-0500 7fd915a11700  7 mds.1.cache linking in remote in [inode 0x10000000022 [...2,head] /volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 rep@0.1 v145 s=0 nl=2 n(v0 rc2022-03-09T01:13:41.938796-0500 1=1+0) (ilink lock) (iversion lock) 0x562100014c00]
2022-03-09T01:16:23.471-0500 7fd915a11700 10 mds.1.cache.strays eval_remote [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d1/ba2222 [2,head] auth REMOTE(reg) (dversion lock) v=37237 ino=0x10000000022 state=1610612736 | inodepin=0 dirty=1 0x5620ffff5680]
2022-03-09T01:16:23.471-0500 7fd915a11700 20 mds.1.cache.strays eval_remote: inode's primary dn not stray
2022-03-09T01:16:23.471-0500 7fd915a11700 10 mds.1.cache path_traverse finish on snapid head

mds.b sends various messages to mds.c, thereby handle_client_getattr getting called multiple times.

(mds.b) `dentry_unlink' message is received which links the stray inode in the dentry:

2022-03-09T01:16:27.219-0500 7fd915a11700  7 mds.1.cache handle_dentry_unlink on [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] rep@0.1 (dn lock) (dversion lock) v=0 ino=0x10000000022 state=0 | inodepin=1 0x56210009ef00]
2022-03-09T01:16:27.219-0500 7fd915a11700 12 mds.1.cache.dir(0x10000000018) unlink_inode [dentry #0x1/volumes/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 [2,head] rep@0.1 (dn lock) (dversion lock) v=0 ino=0x10000000022 state=0 | inodepin=1 0x56210009ef00] [inode 0x10000000022 [...2,head] /volume\
s/_nogroup/subzero/e3dad093-e653-4750-b6c0-4ea09a38093b/d0/f0000 rep@0.1 v145 s=0 nl=2 n(v0 rc2022-03-09T01:13:41.938796-0500 1=1+0) (ilink lock) (iversion lock) | request=1 lock=0 caps=0 remoteparent=1 waiter=1 0x562100014c00]
2022-03-09T01:16:27.219-0500 7fd915a11700 12 mds.1.cache.dir(0x600) link_primary_inode [dentry #0x100/stray0/10000000022 [2,head] rep@0.1 NULL (dn lock) (dversion lock) v=0 ino=(nil) state=64|bottomlru 0x56210009fb80] [inode 0x10000000022 [...2,head] #10000000022 rep@-2.1 v145 s=0 nl=2 n(v0 rc2022-03-09T01:13:41.9\
38796-0500 1=1+0) (ilink lock) (iversion lock) | request=1 lock=0 caps=0 remoteparent=1 waiter=1 0x562100014c00]

The inode gets linked with an empty `stray_prior_path'. The auth mds should have encoded this (inode_t::encode), so there is a bug somewhere...

#21 Updated by Venky Shankar 9 months ago

  • Pull request ID set to 45407

#22 Updated by Venky Shankar 7 months ago

  • Status changed from New to Pending Backport

#23 Updated by Backport Bot 7 months ago

  • Copied to Backport #55427: pacific: unaccessible dentries after fsstress run with namespace-restricted caps added

#24 Updated by Backport Bot 7 months ago

  • Copied to Backport #55428: quincy: unaccessible dentries after fsstress run with namespace-restricted caps added

#25 Updated by Backport Bot 4 months ago

  • Tags set to backport_processed

#26 Updated by Konstantin Shalygin about 2 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (backport_processed)

Also available in: Atom PDF