Project

General

Profile

Actions

Bug #64679

open

cephfs: removexattr should always return -ENODATA when xattr doesn't exist

Added by Xiubo Li about 2 months ago. Updated about 2 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This issue is from https://github.com/ceph/ceph/pull/55087.

The POSIX says we should return -ENODATA when the corresponding
attribute doesn't exist when removing it. While there is one
exception for the acl ones in the local filesystems, for exmaple
for xfs, which will treat it as success.

While in the MDS side there have two ways to remove the xattr:
sending a CEPH_MDS_OP_SETXATTR request by setting the 'flags' with
CEPH_XATTR_REMOVE and just issued a CEPH_MDS_OP_RMXATTR request
directly.

For the first one it will always return 0 when the corresponding
xattr doesn't exist, while for the later one it will return
-ENODATA instead, this should be fixed in MDS to make them to be
consistent.

Actions #1

Updated by Xiubo Li about 2 months ago

  • Description updated (diff)
Actions #2

Updated by Xiubo Li about 2 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 55945
Actions #3

Updated by Xiubo Li about 2 months ago

  • Description updated (diff)
Actions #4

Updated by Venky Shankar about 2 months ago

Xiubo,

I see the following commit in the testing kernel:

commit dd5c1cd171679b4e0ceeab3e8d62f25c2b4e6412 (HEAD -> testing, ceph-client/testing)
Author: Xiubo Li <xiubli@redhat.com>
Date:   Mon Mar 4 10:25:00 2024 +0800

    ceph: return -ENODATA when xattr doesn't exist for removexattr

    The POSIX says we should return -ENODATA when the corresponding
    attribute doesn't exist when removing it. While there is one
    exception for the acl ones in the local filesystems, for exmaple
    for xfs, which will treat it as success.

    While in the MDS side there have two ways to remove the xattr:
    sending a CEPH_MDS_OP_SETXATTR request by setting the 'flags' with
    CEPH_XATTR_REMOVE and just issued a CEPH_MDS_OP_RMXATTR request
    directly.

    For the first one it will always return 0 when the corresponding
    xattr doesn't exist, while for the later one it will return
    -ENODATA instead, this should be fixed in MDS to make them to be
    consistent.

    URL: https://tracker.ceph.com/issues/64679
    Signed-off-by: Xiubo Li <xiubli@redhat.com>

The interesting part from the change is:

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index e066a556eccb..08b14eea66e9 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -613,11 +613,10 @@ static int __set_xattr(struct ceph_inode_info *ci,
                        return err;
                }
                if (update_xattr < 0) {
-                       if (xattr)
-                               __remove_xattr(ci, xattr);
+                       err = __remove_xattr(ci, xattr);
                        kfree(name);
                        kfree(*newxattr);
-                       return 0;
+                       return err;
                }
        }

Is this taking care of the case when a posix acl is being removed? I ask since I'm seeing a failure in fs suite: https://pulpito.ceph.com/vshankar-2024-03-11_01:40:49-fs-main-testing-default-smithi/7590290/

2024-03-11T03:03:56.220 INFO:tasks.workunit.client.0.smithi029.stderr:+ cp -av .snap/k .
2024-03-11T03:03:56.265 INFO:tasks.workunit.client.0.smithi029.stdout:'.snap/k' -> './k'
2024-03-11T03:03:56.265 INFO:tasks.workunit.client.0.smithi029.stdout:'.snap/k/coreutils_8.5.orig.tar.gz' -> './k/coreutils_8.5.orig.tar.gz'
2024-03-11T03:03:56.266 INFO:tasks.workunit.client.0.smithi029.stderr:cp: preserving permissions for './k/coreutils_8.5.orig.tar.gz': No data available
2024-03-11T03:03:56.272 INFO:tasks.workunit.client.0.smithi029.stdout:'.snap/k/coreutils-8.5' -> './k/coreutils-8.5'
2024-03-11T03:03:56.272 INFO:tasks.workunit.client.0.smithi029.stdout:'.snap/k/coreutils-8.5/THANKS' -> './k/coreutils-8.5/THANKS'
2024-03-11T03:03:56.272 INFO:tasks.workunit.client.0.smithi029.stderr:cp: preserving permissions for './k/coreutils-8.5/THANKS': No data available

which seems to be triggered from this (-ENODATA/No Data Available) - which this change is fiddling with, where previously __remove_xattr() is called conditionally, which isn't the case now and __remove_xattr() returns -ENODATA on a non-existent xattr:

static int __remove_xattr(struct ceph_inode_info *ci,
                          struct ceph_inode_xattr *xattr)
{
        if (!xattr)
                return -ENODATA;
...
...
...
}
Actions #5

Updated by Xiubo Li about 2 months ago

Venky Shankar wrote:

Xiubo,

I see the following commit in the testing kernel:

[...]

The interesting part from the change is:

[...]

Is this taking care of the case when a posix acl is being removed? I ask since I'm seeing a failure in fs suite: https://pulpito.ceph.com/vshankar-2024-03-11_01:40:49-fs-main-testing-default-smithi/7590290/

This change will fix the case when the Fx caps is issued.

[...]

which seems to be triggered from this (-ENODATA/No Data Available) - which this change is fiddling with, where previously __remove_xattr() is called conditionally, which isn't the case now and __remove_xattr() returns -ENODATA on a non-existent xattr:

Let me have a look.

BTW, could this be reproduceable ?

[...]

Actions #6

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo,

I see the following commit in the testing kernel:

[...]

The interesting part from the change is:

[...]

Is this taking care of the case when a posix acl is being removed? I ask since I'm seeing a failure in fs suite: https://pulpito.ceph.com/vshankar-2024-03-11_01:40:49-fs-main-testing-default-smithi/7590290/

This change will fix the case when the Fx caps is issued.

Yeh, for xattrs expect posix acls, returning ENOENT is correct. For posix acls with Fx caps, this still needs to return 0, isn't it?

[...]

which seems to be triggered from this (-ENODATA/No Data Available) - which this change is fiddling with, where previously __remove_xattr() is called conditionally, which isn't the case now and __remove_xattr() returns -ENODATA on a non-existent xattr:

Let me have a look.

BTW, could this be reproduceable ?

[...]

Pretty much yes. See - https://pulpito.ceph.com/leonidus-2024-03-09_08:15:07-fs-wip-lusov-qdb-distro-default-smithi/ by Leonid when I saw this error in his branch last Friday and suggested to debug what's going on.

Actions #7

Updated by Xiubo Li about 2 months ago

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo,

I see the following commit in the testing kernel:

[...]

The interesting part from the change is:

[...]

Is this taking care of the case when a posix acl is being removed? I ask since I'm seeing a failure in fs suite: https://pulpito.ceph.com/vshankar-2024-03-11_01:40:49-fs-main-testing-default-smithi/7590290/

This change will fix the case when the Fx caps is issued.

Yeh, for xattrs expect posix acls, returning ENOENT is correct. For posix acls with Fx caps, this still needs to return 0, isn't it?

This should be handled in the caller, not here. Because here is a common helper also for non-ACLs.

[...]

which seems to be triggered from this (-ENODATA/No Data Available) - which this change is fiddling with, where previously __remove_xattr() is called conditionally, which isn't the case now and __remove_xattr() returns -ENODATA on a non-existent xattr:

Let me have a look.

BTW, could this be reproduceable ?

[...]

Pretty much yes. See - https://pulpito.ceph.com/leonidus-2024-03-09_08:15:07-fs-wip-lusov-qdb-distro-default-smithi/ by Leonid when I saw this error in his branch last Friday and suggested to debug what's going on.

So strange, I run the same test locally and I couldn't reproduce it.

Actions #8

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo,

I see the following commit in the testing kernel:

[...]

The interesting part from the change is:

[...]

Is this taking care of the case when a posix acl is being removed? I ask since I'm seeing a failure in fs suite: https://pulpito.ceph.com/vshankar-2024-03-11_01:40:49-fs-main-testing-default-smithi/7590290/

This change will fix the case when the Fx caps is issued.

Yeh, for xattrs expect posix acls, returning ENOENT is correct. For posix acls with Fx caps, this still needs to return 0, isn't it?

This should be handled in the caller, not here. Because here is a common helper also for non-ACLs.

Fair enough. So, the caller seemed to be not catching this, resulting in the error bubbling up to the application.

Actions #9

Updated by Xiubo Li about 2 months ago

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo,

I see the following commit in the testing kernel:

[...]

The interesting part from the change is:

[...]

Is this taking care of the case when a posix acl is being removed? I ask since I'm seeing a failure in fs suite: https://pulpito.ceph.com/vshankar-2024-03-11_01:40:49-fs-main-testing-default-smithi/7590290/

This change will fix the case when the Fx caps is issued.

Yeh, for xattrs expect posix acls, returning ENOENT is correct. For posix acls with Fx caps, this still needs to return 0, isn't it?

This should be handled in the caller, not here. Because here is a common helper also for non-ACLs.

Fair enough. So, the caller seemed to be not catching this, resulting in the error bubbling up to the application.

Yeah. So strange, locally I still couldn't reproduce it by running all the snap related scripts.

Actions #10

Updated by Xiubo Li about 2 months ago

Locally I have tried all the possible cases with the upstream ceph code and couldn't reproduce it, and have partially revert the patch and at the same time added some debug logs:

https://github.com/ceph/ceph-client/commit/bf761b6efc7e0ff8090671a5bd7700e429806836

Actions #11

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

Locally I have tried all the possible cases with the upstream ceph code and couldn't reproduce it, and have partially revert the patch and at the same time added some debug logs:

https://github.com/ceph/ceph-client/commit/bf761b6efc7e0ff8090671a5bd7700e429806836

I'll run the new kernel build with the failed tests and update.

Actions #12

Updated by Xiubo Li about 2 months ago

Venky Shankar wrote:

Xiubo Li wrote:

Locally I have tried all the possible cases with the upstream ceph code and couldn't reproduce it, and have partially revert the patch and at the same time added some debug logs:

https://github.com/ceph/ceph-client/commit/bf761b6efc7e0ff8090671a5bd7700e429806836

I'll run the new kernel build with the failed tests and update.

Sure, thanks. Please share the link about that. I need to do a check.

Actions #13

Updated by Xiubo Li about 2 months ago

I have finally reproduced by by pulling the latest ceph code. I believe there is one commit in MDS have improved the performance by issuing the Xx caps properly.
With the old code because it was not properly issued the Xx caps, so it will always send the sync setattr request to MDS, while my patch is just fixing the local cached changes.

I have updated the PR and the kclient patches to fix and improve it.

Actions #15

Updated by Venky Shankar about 2 months ago

Xiubo Li wrote:

I have finally reproduced by by pulling the latest ceph code. I believe there is one commit in MDS have improved the performance by issuing the Xx caps properly.
With the old code because it was not properly issued the Xx caps, so it will always send the sync setattr request to MDS, while my patch is just fixing the local cached changes.

I have updated the PR and the kclient patches to fix and improve it.

I'll have a look. Thx. BTW, the kclient patch

                if (update_xattr < 0) {
+                       static int __counter = 0;
                        err = __remove_xattr(ci, xattr);
+                       if (err) {
+                               pr_warn_ratelimited_client(cl,
+                                       "__remove_xattr return %d, flags %d\n",
+                                       err, flags);
+                               if (0 == __counter++)
+                                       dump_stack();
+                       }
                        kfree(name);
                        kfree(*newxattr);
-                       return err;
+                       return 0;

caused the tests to pass (return 0!). Do you still want to debug the dump_stack?

Actions #16

Updated by Xiubo Li about 2 months ago

Venky Shankar wrote:

Xiubo Li wrote:

I have finally reproduced by by pulling the latest ceph code. I believe there is one commit in MDS have improved the performance by issuing the Xx caps properly.
With the old code because it was not properly issued the Xx caps, so it will always send the sync setattr request to MDS, while my patch is just fixing the local cached changes.

I have updated the PR and the kclient patches to fix and improve it.

I'll have a look. Thx. BTW, the kclient patch

[...]

caused the tests to pass (return 0!). Do you still want to debug the dump_stack?

Thanks Venky, locally I have reproduced it and fixed it in the testing branch. More detail please see https://tracker.ceph.com/issues/64679#note-13. I will send out the patch soon.

Actions

Also available in: Atom PDF