Project

General

Profile

Actions

Bug #64172

open

Test failure: test_multiple_path_r (tasks.cephfs.test_admin.TestFsAuthorize)

Added by Venky Shankar 3 months ago. Updated 2 months ago.

Status:
Fix Under Review
Priority:
Immediate
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy,reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2024-01-22_07:03:31-fs-wip-vshankar-testing-20240119.075157-1-testing-default-smithi/7525717

The test setup involves "read" cap on a file system path (directory), remount the directory as file system root and read the created files.

MDS logs: ./remote/smithi157/log/ceph-mds.c.log.gz

2024-01-22T08:27:55.205+0000 7f81a7600640  1 -- [v2:172.21.15.157:6835/1231338113,v1:172.21.15.157:6837/1231338113] <== client.17628 v1:192.168.0.1:0/312855551 9 ==== client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6 ==== 176+0+0 (unknown 772642831 0 0) 0x55ab1d52bb00 con 0x55ab1d52e400
2024-01-22T08:27:55.205+0000 7f81a7600640  4 mds.0.server handle_client_request client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6
2024-01-22T08:27:55.205+0000 7f81a7600640 20 mds.0.356 get_session have 0x55ab1d202f00 client.17628 v1:192.168.0.1:0/312855551 state open
2024-01-22T08:27:55.205+0000 7f81a7600640 15 mds.0.server  oldest_client_tid=5
2024-01-22T08:27:55.205+0000 7f81a7600640  7 mds.0.cache request_start request(client.17628:5 nref=2 cr=0x55ab1d52bb00)
2024-01-22T08:27:55.205+0000 7f81a7600640  7 mds.0.server dispatch_client_request client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6
2024-01-22T08:27:55.205+0000 7f81a7600640 20 Session check_access path
2024-01-22T08:27:55.205+0000 7f81a7600640 10 MDSAuthCap is_capable inode(path / owner 0:0 mode 041777) by caller 0:0 mask 0 new 0:0 cap: MDSAuthCaps[allow r fsname=cephfs path="/dir1/dir12", allow r fsname=cephfs path="/dir2/dir22"]
2024-01-22T08:27:55.205+0000 7f81a7600640  7 mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6
2024-01-22T08:27:55.205+0000 7f81a7600640 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x0
2024-01-22T08:27:55.205+0000 7f81a7600640 20 mds.0.server lat 0.000095
2024-01-22T08:27:55.205+0000 7f81a7600640 10 mds.0.356 send_message_client client.17628 v1:192.168.0.1:0/312855551 client_reply(???:5 = -13 (13) Permission denied) v1
2024-01-22T08:27:55.205+0000 7f81a7600640  1 -- [v2:172.21.15.157:6835/1231338113,v1:172.21.15.157:6837/1231338113] --> v1:192.168.0.1:0/312855551 -- client_reply(???:5 = -13 (13) Permission denied) v1 -- 0x55ab1d5b4700 con 0x55ab1d52e400

Related issues 1 (1 open0 closed)

Related to CephFS - Bug #63764: Test failure: test_r_with_fsname_and_no_path_in_cap (tasks.cephfs.test_multifs_auth.TestMDSCaps)In ProgressRishabh Dave

Actions
Actions #1

Updated by Venky Shankar 3 months ago

  • Related to Bug #63764: Test failure: test_r_with_fsname_and_no_path_in_cap (tasks.cephfs.test_multifs_auth.TestMDSCaps) added
Actions #2

Updated by Venky Shankar 3 months ago

  • Assignee set to Rishabh Dave
  • Backport set to quincy,reef
  • Severity changed from 3 - minor to 2 - major
Actions #3

Updated by Venky Shankar 3 months ago

  • Priority changed from Normal to Immediate

Rishabh, squid branch cutoff is in a weeks time and it would be good to fix this before we cut out the branch. Let's get this in soon!

Actions #4

Updated by Rishabh Dave 3 months ago

Venky Shankar wrote:

Rishabh, squid branch cutoff is in a weeks time and it would be good to fix this before we cut out the branch. Let's get this in soon!

Cool, I'll make this the highest priority in that case.

Actions #5

Updated by Venky Shankar 3 months ago

Rishabh Dave wrote:

Venky Shankar wrote:

Rishabh, squid branch cutoff is in a weeks time and it would be good to fix this before we cut out the branch. Let's get this in soon!

Cool, I'll make this the highest priority in that case.

Any updates on this, Rishabh?

Actions #6

Updated by Venky Shankar 3 months ago

Venky Shankar wrote:

Rishabh Dave wrote:

Venky Shankar wrote:

Rishabh, squid branch cutoff is in a weeks time and it would be good to fix this before we cut out the branch. Let's get this in soon!

Cool, I'll make this the highest priority in that case.

Any updates on this, Rishabh?

ping?

Actions #7

Updated by Rishabh Dave 3 months ago

Venky Shankar wrote:

Rishabh Dave wrote:

Venky Shankar wrote:

Rishabh, squid branch cutoff is in a weeks time and it would be good to fix this before we cut out the branch. Let's get this in soon!

Cool, I'll make this the highest priority in that case.

Any updates on this, Rishabh?

I ran this test several times locally with vstart_runner.py and manually. But the bug couldn't be reproduced a single time.

From teuthology.log -

2024-01-22T08:27:53.365 DEBUG:teuthology.orchestra.run.smithi157:> (cd /home/ubuntu/cephtest/mnt.0 && exec cat /home/ubuntu/cephtest/mnt.0/testdir/testfile)
2024-01-22T08:27:53.374 INFO:teuthology.orchestra.run.smithi157.stderr:cat: /home/ubuntu/cephtest/mnt.0/testdir/testfile: Permission denied
2024-01-22T08:27:53.375 DEBUG:teuthology.orchestra.run:got remote process result: 1

The test for whether or not client can read caps failed. Just a second before that Permission denied occurred one more time -

2024-01-22T08:27:52.364 DEBUG:teuthology.orchestra.run.smithi157:> sudo chmod 1777 /home/ubuntu/cephtest/mnt.0
2024-01-22T08:27:52.384 DEBUG:teuthology.orchestra.run:got remote process result: 1
2024-01-22T08:27:52.384 INFO:teuthology.orchestra.run.smithi157.stderr:chmod: changing permissions of '/home/ubuntu/cephtest/mnt.0': Permission denied

KernelMount.mount() calls CephFSMount._set_filemode_on_mntpt() which runs the cat command above. See https://github.com/ceph/ceph/blob/main/qa/tasks/cephfs/mount.py#L252. This is strange and, from bug's perspective, useful because mount command was successful -

2024-01-22T08:27:52.246 DEBUG:teuthology.orchestra.run.smithi157:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /bin/mount -t ceph :/dir1/dir12 /home/ubuntu/cephtest/mnt.0 -v -o norequire_active_mds,secret=AQAFJ65lkZwHBxAA8QRUMP77h1tVnUNLMjqL4g==,conf=/etc/ceph/ceph.conf,norbytes,name=testuser,mds_namespace=cephfs
...
2024-01-22T08:27:52.363 INFO:tasks.cephfs.kernel_mount:mount command passed

CephFS dir /dir1/dir12 is mounted for which the client has acquired caps -

2024-01-22T08:27:48.727 DEBUG:teuthology.orchestra.run.smithi157:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs authorize cephfs client.testuser /dir1/dir12 r /dir2/dir22 r
...
2024-01-22T08:27:49.115 INFO:teuthology.orchestra.run.smithi157.stdout:[client.testuser]
2024-01-22T08:27:49.115 INFO:teuthology.orchestra.run.smithi157.stdout: key = AQAFJ65lkZwHBxAA8QRUMP77h1tVnUNLMjqL4g==
2024-01-22T08:27:49.115 INFO:teuthology.orchestra.run.smithi157.stdout: caps mds = "allow r fsname=cephfs path=/dir1/dir12, allow r fsname=cephfs path=/dir2/dir22" 
2024-01-22T08:27:49.115 INFO:teuthology.orchestra.run.smithi157.stdout: caps mon = "allow r fsname=cephfs" 
2024-01-22T08:27:49.115 INFO:teuthology.orchestra.run.smithi157.stdout: caps osd = "allow r tag cephfs data=cephfs" 

The fact that mount was successful but subsequent operations, chmod and read (using cat), are unsuccessful is pretty twisted. If client didn't have permission even for read (which is minimum for MDS caps), mount should've failed. And if mount passes, (at least) read operation should've also passed.

I don't see how to proceed further to figure out the cause of this contradiction. At least there's no way to dig deeper using teuthology.log.

Now, analyzing MDS logs -

2024-01-22T08:27:55.205+0000 7f81a7600640  7 mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6

I inspected MDSAuthCaps::is_capable() few times to try and spot a logical error that might have caused this bug but nothing turned up.

Revisiting MDS logs, I spotted the following line -

2024-01-22T08:27:55.205+0000 7f81a7600640 10 MDSAuthCap is_capable inode(path / owner 0:0 mode 041777) by caller 0:0 mask 0 new 0:0 cap: MDSAuthCaps[allow r fsname=cephfs path="/dir1/dir12", allow r fsname=cephfs path="/dir2/dir22"]

This line lies just before the line that reports Permission denied. And, according to details in it, it is fair for this bug to occur since access is requested for CephFS root but no caps are assigned for it. Caps are only assigned for /dir1/dir12 and /dir2/dir22. Therefore, it appears that MDSAuthCaps::is_capable() is doing its job correctly and bug didn't occur due to it.

 -1087> 2024-01-22T08:27:53.373+0000 7f81a7600640 10 MDSAuthCap is_capable inode(path /dir1/dir12/testdir owner 1000:1271 mode 040775) by caller 1000:1271 mask 1 new 0:0 cap: MDSAuthCaps[allow r fsname=cephfs path="/dir1/dir12", allow r fsname=cephfs path="/dir2/dir22"]

So, I've started going through functions that called it. There is a log entry for Session::check_access() before this. inode_path from MDSAuthCaps::is_capable() is assigned it's value in this fucntion in this line - path = in->get_projected_inode()->stray_prior_path;. See: https://github.com/ceph/ceph/blob/main/src/mds/SessionMap.cc#L1047. The inode from which this path is received is assigned here - CInode *cur = try_get_auth_inode(mdr, req->get_filepath().get_ino()); https://github.com/ceph/ceph/blob/main/src/mds/Server.cc#L2680.

I also spotted a few cases requested access for @/dir1/dir12 and there are log message registering Permission denied after it. Following is one such instance -

2024-01-22T08:27:53.373+0000 7f81a7600640 20 Session check_access path /dir1/dir12/testdir
2024-01-22T08:27:53.373+0000 7f81a7600640 10 MDSAuthCap is_capable inode(path /dir1/dir12/testdir owner 1000:1271 mode 040775) by caller 1000:1271 mask 1 new 0:0 cap: MDSAuthCaps[allow r fsname=cephfs path="/dir1/dir12", allow r fsname=cephfs path="/dir2/dir22"]
2024-01-22T08:27:53.373+0000 7f81a7600640 20 mds.0.bal hit_dir 0 pop is 1, frag * size 1 [pop IRD:[C 1.00e+00] IWR:[C 7.24e-01] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:2.4]
2024-01-22T08:27:53.373+0000 7f81a7600640 20 mds.0.bal hit_dir 0 pop 10 spread in [dir 0x10000000001 /dir1/dir12/ [2,head] auth v=9 cv=0/0 ap=0+2 state=1610612737|complete f(v0 m2024-01-22T08:27:48.566048+0000 1=0+1) n(v0 rc2024-01-22T08:27:48.622047+0000 b148 2=1+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=0 0x55ab1d5aad80]
2024-01-22T

There are 3 more such instances.

I intend to dig deeper through SessionMap.cc, Server.cc and any other code region related to this code path. But since I am not very well-acquainted with code besides MDSAuthCap, I think I should get an ACK from you before digging deeper.

If I am something, please let me know.

Actions #8

Updated by Venky Shankar 2 months ago

Rishabh Dave wrote:

Venky Shankar wrote:

Rishabh Dave wrote:

Venky Shankar wrote:

Rishabh, squid branch cutoff is in a weeks time and it would be good to fix this before we cut out the branch. Let's get this in soon!

Cool, I'll make this the highest priority in that case.

Any updates on this, Rishabh?

I ran this test several times locally with vstart_runner.py and manually. But the bug couldn't be reproduced a single time.

Are you using the user-space client for reproducing this? Have you tried using kernel clients?

Actions #9

Updated by Rishabh Dave 2 months ago

Venky Shankar wrote:

Rishabh Dave wrote:
I ran this test several times locally with vstart_runner.py and manually. But the bug couldn't be reproduced a single time.

Are you using the user-space client for reproducing this? Have you tried using kernel clients?

I tried both, neither could reproduce it. To be sure, I tried both today again and the test passed in both the cases.

Actions #10

Updated by Rishabh Dave 2 months ago

I also ran this test with FUSE and kernel client on a very recent commit (137ea0f8739d992574f1dcd008176e0d8a7c34e5), it can't be reproduced. I've started going through recent changes made by Xiubo in ceph-client repo. Let me know if you know find some recent change by him to be relevant or something entirely else.

Actions #11

Updated by Venky Shankar 2 months ago

Rishabh Dave wrote:

I also ran this test with FUSE and kernel client on a very recent commit (137ea0f8739d992574f1dcd008176e0d8a7c34e5), it can't be reproduced. I've started going through recent changes made by Xiubo in ceph-client repo. Let me know if you know find some recent change by him to be relevant or something entirely else.

Do you have the links to the test runs?

Actions #12

Updated by Rishabh Dave 2 months ago

Venky Shankar wrote:

Rishabh Dave wrote:

I also ran this test with FUSE and kernel client on a very recent commit (137ea0f8739d992574f1dcd008176e0d8a7c34e5), it can't be reproduced. I've started going through recent changes made by Xiubo in ceph-client repo. Let me know if you know find some recent change by him to be relevant or something entirely else.

Do you have the links to the test runs?

Those are tests were run using vstart_runner.py.

Venky, you mentioned that you have some commits that you suspect that might be the cause of this issue. Please mention them when you find some time. I'll try to look at recent commits from Xiubo in the mean time.

Actions #13

Updated by Venky Shankar 2 months ago

Rishabh Dave wrote:

Venky Shankar wrote:

Rishabh Dave wrote:

I also ran this test with FUSE and kernel client on a very recent commit (137ea0f8739d992574f1dcd008176e0d8a7c34e5), it can't be reproduced. I've started going through recent changes made by Xiubo in ceph-client repo. Let me know if you know find some recent change by him to be relevant or something entirely else.

Do you have the links to the test runs?

Those are tests were run using vstart_runner.py.

Sure, but were they using the testing kernel?

Venky, you mentioned that you have some commits that you suspect that might be the cause of this issue. Please mention them when you find some time. I'll try to look at recent commits from Xiubo in the mean time.

About to reply to the patchset in ceph-devel (will cc you, Rishabh).

Actions #14

Updated by Xiubo Li 2 months ago

It seems we have hit this 3 times, right ?

Actions #15

Updated by Xiubo Li 2 months ago

Rishabh,

The test failed at around 2024-01-22T08:27:53.365:

2024-01-22T08:27:53.364 INFO:tasks.cephfs.caps_helper:FS names are mentioned in moncap. moncap -
{moncap}
2024-01-22T08:27:53.364 INFO:tasks.cephfs.caps_helper:testing for presence of these FS names in output of "fs ls" command run by client.
2024-01-22T08:27:53.364 INFO:tasks.cephfs.caps_helper:test read perm: read file /home/ubuntu/cephtest/mnt.0/testdir/testfile and expect data "self.path = /home/ubuntu/cephtest/mnt.0/dir1/dir12/testdir/testfile
cephfs_name = cephfs
cephfs_mntpt = /
hostfs_mntpt = /home/ubuntu/cephtest/mnt.0" 
2024-01-22T08:27:53.365 INFO:teuthology.orchestra.run:Running command with timeout 300
2024-01-22T08:27:53.365 DEBUG:teuthology.orchestra.run.smithi157:> (cd /home/ubuntu/cephtest/mnt.0 && exec cat /home/ubuntu/cephtest/mnt.0/testdir/testfile)
2024-01-22T08:27:53.374 INFO:teuthology.orchestra.run.smithi157.stderr:cat: /home/ubuntu/cephtest/mnt.0/testdir/testfile: Permission denied
2024-01-22T08:27:53.375 DEBUG:teuthology.orchestra.run:got remote process result: 1

And I didn't find any Permission denied failure during this time in the mds logs. The first failure was around 2024-01-22T08:27:55.205, which was 2 minutes later. And this time it was trying to lookupino #0x1, which should be expected, because the clients couldn't access to the cephfs' root directory:

  -943> 2024-01-22T08:27:55.205+0000 7f81a7600640  7 mds.0.server reply_client_request -13 ((13) Permission denied) client_request(client.17628:5 lookupino #0x1 2024-01-22T08:27:55.205939+0000 caller_uid=0, caller_gid=0{0,}) v6
  -942> 2024-01-22T08:27:55.205+0000 7f81a7600640 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x0
  -941> 2024-01-22T08:27:55.205+0000 7f81a7600640 20 mds.0.server lat 0.000095
  -940> 2024-01-22T08:27:55.205+0000 7f81a7600640 10 mds.0.356 send_message_client client.17628 v1:192.168.0.1:0/312855551 client_reply(???:5 = -13 (13) Permission denied) v1
  -939> 2024-01-22T08:27:55.205+0000 7f81a7600640  1 -- [v2:172.21.15.157:6835/1231338113,v1:172.21.15.157:6837/1231338113] --> v1:192.168.0.1:0/312855551 -- client_reply(???:5 = -13 (13) Permission denied) v1 -- 0x55ab1d5b4700 con 0x55ab1d52e400

In another word, the testfile ino# was 0x10000000005, so we should see accessing 0x10000000005 Permission denied instead.

So I just suspect it failed from the kclient directly, and never got a chance to send the request to mds as I mentioned in the meeting:

 359 int ceph_open(struct inode *inode, struct file *file)
 360 { 
...
 399         if (!dentry) {
 400                 do_sync = true;
 401         } else {
 402                 path = ceph_mdsc_build_path(mdsc, dentry, &pathlen, &pathbase, 0);
 403                 if (IS_ERR(path)) {
 404                         do_sync = true;
 405                         err = 0;
 406                 } else {
 407                         err = ceph_mds_check_access(mdsc, path, mask);
 408                 }
 409                 ceph_mdsc_free_path(path, pathlen);
 410                 dput(dentry);
 411 
 412                 /* For none EACCES cases will let the MDS do the mds auth check */
 413                 if (err == -EACCES) {
 414                         return err;
 415                 } else if (err < 0) {
 416                         do_sync = true;
 417                         err = 0;
 418                 }
 419         }
...

It should return from Line#413. We should focus on what has happened in ceph_mds_check_access(). Also possibly the ceph_mdsc_build_path() just returned a wrong path.

Actions #16

Updated by Xiubo Li 2 months ago

I found the bug in the kclient patch sries [PATCH v3 0/6] ceph: check the cephx mds auth access in client side, the ceph_mdsc_build_path() just returns an incorrect path.

For example, when doing $ sudo ./bin/mount.ceph :/mytest/ /mnt/kcephfs, and then try to $ cat /mnt/kcephfs/file, it will just return file path of file, while it should be mytestdir/file instead. This should be the same issue as Greg mentioned in https://github.com/ceph/ceph/pull/48027#issuecomment-1741019086.

@Rishabh, the patch series hasn't been merged yet, so I will update it instead of sending a new patch on it.

Actions #17

Updated by Xiubo Li 2 months ago

Xiubo Li wrote:

I found the bug in the kclient patch sries [PATCH v3 0/6] ceph: check the cephx mds auth access in client side, the ceph_mdsc_build_path() just returns an incorrect path.

For example, when doing $ sudo ./bin/mount.ceph :/mytest/ /mnt/kcephfs, and then try to $ cat /mnt/kcephfs/file, it will just return file path of file, while it should be mytestdir/file instead. This should be the same issue as Greg mentioned in https://github.com/ceph/ceph/pull/48027#issuecomment-1741019086.

@Rishabh, the patch series hasn't been merged yet, so I will update it instead of sending a new patch on it.

Just ignore this. I have fixed this issue in the 3rd version of the patch series, more detail please see ceph_mds_auth_match(). It should be a bug in ceph_mds_check_access() then.

Actions #18

Updated by Venky Shankar 2 months ago

Xiubo Li wrote:

Xiubo Li wrote:

I found the bug in the kclient patch sries [PATCH v3 0/6] ceph: check the cephx mds auth access in client side, the ceph_mdsc_build_path() just returns an incorrect path.

For example, when doing $ sudo ./bin/mount.ceph :/mytest/ /mnt/kcephfs, and then try to $ cat /mnt/kcephfs/file, it will just return file path of file, while it should be mytestdir/file instead. This should be the same issue as Greg mentioned in https://github.com/ceph/ceph/pull/48027#issuecomment-1741019086.

@Rishabh, the patch series hasn't been merged yet, so I will update it instead of sending a new patch on it.

Just ignore this. I have fixed this issue in the 3rd version of the patch series, more detail please see ceph_mds_auth_match(). It should be a bug in ceph_mds_check_access() then.

I did doubt that implementation. Thx for confirming Xiubo.

Actions #19

Updated by Venky Shankar 2 months ago

  • Assignee changed from Rishabh Dave to Xiubo Li

Xiubo - assigning this tracker to you since the you own the kclient patchset.

Actions #20

Updated by Venky Shankar 2 months ago

  • Component(FS) kceph added
  • Component(FS) deleted (MDS)
Actions #21

Updated by Xiubo Li 2 months ago

Venky Shankar wrote:

Xiubo - assigning this tracker to you since the you own the kclient patchset.

Sure.

Actions #22

Updated by Xiubo Li 2 months ago

  • Status changed from New to In Progress
Actions #23

Updated by Xiubo Li 2 months ago

  • Status changed from In Progress to Fix Under Review
Actions #24

Updated by Xiubo Li 2 months ago

  • Project changed from CephFS to Linux kernel client
  • Category deleted (Correctness/Safety)
Actions

Also available in: Atom PDF