Project

General

Profile

Feature #53730

ceph-fuse: suppor "entry_timeout" and "attr_timeout" options for improve performance

Added by Sheng Xie over 2 years ago. Updated over 1 year ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
ceph-fuse
Labels (FS):
Pull request ID:

Description

I noticed that use ceph-fuse to mount to a directory has lower performance than use the kernel mode to mount to a directory, such as when executing a simple "ls -l" command
Why? I think the user mode requires more conversion between kernel mode and user mode when executing commands.
Therefore, I think increasing the values of the parameters "entry_timeout" and "attr_timeout" provided by fuse can reduce the conversion from kernel state to user state of fuse file system, so as to increase the time-consuming of commands such as “ls -l”.
I did. I increased "entry_timeout" and "attr_timeout" from 0 to 60. By modifying the CEPH fuse source code, the test found that when executing the LS - L command in a directory containing many files, the time-consuming was more than doubled. By checking the CEPH fuse log, I found that the requests received by CEPH fuse were reduced by five times(The number of parent directories has been reduced ‘ll_lookup’operation).
I also simply tested the data consistency after such modification. I deleted a file from one ceph-fuse mount point, and it can still be seen that the file has been deleted immediately from another ceph-fuse mount point.
Therefore, I want to add "fuse_entry_timeout" and "fuse_attr_timeout" parameters to ceph-fuse

YB~6KVSS}XC{LJ[KB5QUO8Q.png View - test results (16.2 KB) Sheng Xie, 01/14/2022 01:41 AM

History

#1 Updated by Xiubo Li over 2 years ago

BTW, have you test fuse vs kclient at the same time by setting the `entry_timeout` and `attr_timeout` ?

Caching only works correctly if you implement a disk based filesystem, one where only the fuse process can alter metadata and all access goes only through fuse.

I didn't read the fuse related code, but from the related comments I think the above timeouts will cache them in the fuse kernel cache. But if there have more than one fuse clients in different nodes or other kclients also have changed the same files, what will happen ? Won't there have data consistency issues ?

#2 Updated by Sheng Xie over 2 years ago

Sorry, I didn't test kclient I just compared the performance of ceph-fuse mount point before and after setting 'timeout', and simply tested the data consistency between the two clients several times.

I totally agree with your explanation on 'timeout'. And i have the same thinking about data consistency as you. But I don't know why my data consistency test is OK. Or do I need to do more tests on data consistency? In addition, the default value of fuse timeout is 1s. If there is a data consistency problem in 60s, will 1s have the same problem?

In addition, in some scenarios, only one client will use a cephfs directory within a specified time. Based on this scenario, do you consider adding 'timeout' settings for users to use in specific scenarios?

#3 Updated by Xiubo Li over 2 years ago

xie sheng wrote:

Sorry, I didn't test kclient I just compared the performance of ceph-fuse mount point before and after setting 'timeout', and simply tested the data consistency between the two clients several times.

I totally agree with your explanation on 'timeout'. And i have the same thinking about data consistency as you. But I don't know why my data consistency test is OK. Or do I need to do more tests on data consistency? In addition, the default value of fuse timeout is 1s. If there is a data consistency problem in 60s, will 1s have the same problem?

Why do you think the default value is 1s ? Should it be 0 for ceph-fuse. More detail please see https://github.com/ceph/ceph/blob/master/src/client/fuse_ll.cc#L217, the `fuse_ll_lookup()` has set the `struct fuse_entry_param` to 0 except the `attr` and `ino` members in it.

Maybe you can test this with two nodes, both using the ceph-fuse: firstly in one node you can set the timeout to 60s or lager, and try to lookup or stat a directory, and secondly in another node you can try to delete or change that directory, and lastly try to lookup or stat that directory again in the first node to see what will happen ?

In addition, in some scenarios, only one client will use a cephfs directory within a specified time. Based on this scenario, do you consider adding 'timeout' settings for users to use in specific scenarios?

IMO this will only make sense when there has only one ceph-fuse client exists in a fs cluster. Or you must disable it.

For exmaple if one fuse client has already mounted the /mydir/subdir/, if another kclient/fuse/libcephfs will mount the /mydir/ and then delete the subdir/. If the fuse client still caches that, bad things will happen.

#4 Updated by Sheng Xie over 2 years ago

Xiubo Li wrote:

Why do you think the default value is 1s ? Should it be 0 for ceph-fuse. More detail please see https://github.com/ceph/ceph/blob/master/src/client/fuse_ll.cc#L217, the `fuse_ll_lookup()` has set the `struct fuse_entry_param` to 0 except the `attr` and `ino` members in it

Thanks for reminding me. I reviewed the 'fuse_ll_lookup()' code and noticed that fuse_entry_Param is initialized to 0.

Maybe you can test this with two nodes, both using the ceph-fuse: firstly in one node you can set the timeout to 60s or lager, and try to lookup or stat a directory, and secondly in another node you can try to delete or change that directory, and lastly try to lookup or stat that directory again in the first node to see what will happen ?

I retested the consistency problem manually,No consistency problem was found.
Here is my test steps.

Both nodes(A-node,B-node) execute the same 'ceph-fuse -r /test /cephfs --id=cephfs' command to mount the directory.ceph-fuse.rpm on two nodes The RPM packages have been modified and has been set entry_timeout=60,attr_timeout=60. Of course I tested that they all worked.

case A:
(Execute the following cmd in sequence as quickly as possible)
A-node: 'll /cephfs'
B-node: 'mkdir /cephfs/dir0'
A-node: 'll /cephfs'
result: A-node can see the /cephfs/dir0 directory immediately.

case B:
A-node: 'll /cephfs'
B-node: 'rm /cephfs/dir0 -rf'
A-node:'ll /cephfs'
result: A-node can't see the dir0 directory immediately.

Now I'm also confused about how fuse maintains data consistency when holding the cache. Maybe there are other mechanisms?

IMO this will only make sense when there has only one ceph-fuse client exists in a fs cluster. Or you must disable it.
For exmaple if one fuse client has already mounted the /mydir/subdir/, if another kclient/fuse/libcephfs will mount the /mydir/ and then delete the subdir/. If the fuse client still caches that, bad things will happen.

The scenarios you mentioned should be disabled and set 'timeout' to 0, but there must be some scenarios where setting 'timeout' to non-0 is more friendly. In fact, I have encountered such scenarios. In this scenario, a cephfs directory will only be mounted by a ceph-fuse process (yes, we don't want to use RBD). So can we make the 'timeout' configurable and default value is 0.

#5 Updated by Xiubo Li over 2 years ago

Could u try:

case C:
A-node: 'touch /cephfs/file0; ll /cephfs'
B-node: 'ln /cephfs/file0 /cephfs/file1; rm /cephfs/file0 -f'
A-node:'ll /cephfs/file0'

To see whether could you see the file0 in A-node ?

From the ceph-fuse code, when you're deleting a directory or file, which has no hard link, the ceph-fuse could be notified and then it will try to delete the dentry cache in fuse kernel, I guess this should be the reason why you couldn't see the 'dir0' in case B, I didn't test this yet.

NOTE: there have some other places you may need to change except the `fuse_ll_lookup()`.

#6 Updated by Sheng Xie over 2 years ago

I tested case C. as you guessed, A-node can still see file0 after B-node deletes file0. but when 'll /cephfs' is executed, file0 cannot be seen.
This should be related to the inode lookup count of directories and files.

This is fuse's definition of unlink method.Maybe this question is related to this

    /**
     * Remove a file
     *
     * If the file's inode's lookup count is non-zero, the file
     * system is expected to postpone any removal of the inode
     * until the lookup count reaches zero (see description of the
     * forget function).
     *
     * Valid replies:
     *   fuse_reply_err
     *
     * @param req request handle
     * @param parent inode number of the parent directory
     * @param name to remove
     */
    void (*unlink) (fuse_req_t req, fuse_ino_t parent, const char *name);

Yes, now I know that other places need to be modified to ensure data consistency, such as in case C.
It would be better if you could provide more accurate instructions. I haven't fully figured out how a cephfs client notifies other cephfs clients to clear the dentry cache in the kernel when the file is considered to need to be deleted.

#7 Updated by Xiubo Li over 2 years ago

Sheng Xie wrote:

I tested case C. as you guessed, A-node can still see file0 after B-node deletes file0. but when 'll /cephfs' is executed, file0 cannot be seen.

This is expected, this is why I let you do `ll /cephfs/file0` instead of `ll /cephfs/`.

In MDS side when you try to delete/create new dentries, the MDS will try to wrlock the directory's inode, which will cause it revoke Fs caps from the clients and the client will clear the COMPLETE flag. So after that when you readdir again, if the directory is not COMPLETE the clients will always ignore the local caches and get all the dentries from MDS.

This should be related to the inode lookup count of directories and files.

This is fuse's definition of unlink method.Maybe this question is related to this
[...]

Yes, now I know that other places need to be modified to ensure data consistency, such as in case C.
It would be better if you could provide more accurate instructions. I haven't fully figured out how a cephfs client notifies other cephfs clients to clear the dentry cache in the kernel when the file is considered to need to be deleted.

The notification mostly in caps grant/update,etc. You need to check the MDS locker and capiblities related code and logics.

#8 Updated by Sheng Xie about 2 years ago

Thank you for your prompt.
I spent some time learning the logic of MDS caps, although these contents are complex and I haven't fully understood them but I found a solution. That is, I just need to make MDS recycle CEPH_CAP_LINK_SHARED, whether inode->nlink is equal to 0 when issue_caps(). I will test whether this method works.

#9 Updated by Xiubo Li about 2 years ago

Sheng Xie wrote:

Thank you for your prompt.
I spent some time learning the logic of MDS caps, although these contents are complex and I haven't fully understood them but I found a solution. That is, I just need to make MDS recycle CEPH_CAP_LINK_SHARED, whether inode->nlink is equal to 0 when issue_caps(). I will test whether this method works.

Wait, I don't think this will work.

If one inode have hard links and you only delete one dentry of them, the inode->nlink won't be zero, but you still need to tell all the ceph-fuse clients to invalidate that dentry cache in kernel fuse.

I think you may need to do this whenever the inode->nlink decreases and tell all the ceph-fuse clients to invalidate the related caches.

#10 Updated by Sheng Xie about 2 years ago

Xiubo Li wrote:

If one inode have hard links and you only delete one dentry of them, the inode->nlink won't be zero, but you still need to tell all the ceph-fuse clients to invalidate that dentry cache in kernel fuse.

Yes, I know this.

This is the issue_Caps() code snippet before modification.

    // notify clients about deleted inode, to make sure they release caps ASAP.
    if (in->inode.nlink == 0)
      wanted |= CEPH_CAP_LINK_SHARED;

I mean, remove the judgment of “nlink == 0“” and recycle CEPH_CAP_LINK_SHARED in any case. But now I know it will cause other problems.

Xiubo Li wrote:

I think you may need to do this whenever the inode->nlink decreases and tell all the ceph-fuse clients to invalidate the related caches.

I think there will be some problems, such as rename operation, inode nlink count will not change ?

Therefore, we may need to trigger the kernel to clear the corresponding inode cache according to the specific operation(unlink,rename, and so on) type, rather than nlink?

#11 Updated by Xiubo Li about 2 years ago

Sheng Xie wrote:

Xiubo Li wrote:
[...]
Yes, I know this.

This is the issue_Caps() code snippet before modification.
[...]
I mean, remove the judgment of “nlink == 0“” and recycle CEPH_CAP_LINK_SHARED in any case. But now I know it will cause other problems.

This is not recycling the 'Ls' caps, it's just trying to issue 'Ls' caps to clients here. But it still won't make sure that the MDS could successfully issue it to clients, this will depend on the current lock state of the in->linklock.

Xiubo Li wrote:
[...]
I think there will be some problems, such as rename operation, inode nlink count will not change ?

I think so.

Therefore, we may need to trigger the kernel to clear the corresponding inode cache according to the specific operation(unlink,rename, and so on) type, rather than nlink?

I was thinking may add a dedicated member or one dedicated notification message to do this instead of nlink.

#12 Updated by Sheng Xie about 2 years ago

I will modify it gracefully as soon as possible after fully understanding this logic but it may take some time.

#13 Updated by Xiubo Li about 2 years ago

Sheng Xie wrote:

I will modify it gracefully as soon as possible after fully understanding this logic but it may take some time.

Today I pushed a patch to fix another issue about the dentry cache, not sure could it help to push this issue. You can try it and if it works then you can go on to try to add the "entry_timeout" and "attr_timeout". And you may also need to consider some corner cases which we haven't found yet.

[1]: https://github.com/ceph/ceph/pull/44432

#14 Updated by Xiubo Li about 2 years ago

Xiubo Li wrote:

Sheng Xie wrote:

I will modify it gracefully as soon as possible after fully understanding this logic but it may take some time.

Today I pushed a patch to fix another issue about the dentry cache, not sure could it help to push this issue. You can try it and if it works then you can go on to try to add the "entry_timeout" and "attr_timeout". And you may also need to consider some corner cases which we haven't found yet.

[1]: https://github.com/ceph/ceph/pull/44432

I test this today, and it works in fuse too, more detail please see: https://github.com/ceph/ceph/pull/44432#discussion_r779235719

#15 Updated by Sheng Xie about 2 years ago

I just carried out the same test. Your modification solves this problem very well.SO i will go on to try to add the "entry_timeout" and "attr_timeout".

#16 Updated by Sheng Xie about 2 years ago

I did a simple test and have merged your pr: Test "ls -l"the time-consuming of a directory containing 3000*4K files and count the op received by ceph-fuse daemon.
test results

The test results are exciting!
When timeout = 0,this is the entry,attr is no cache,the time-consuming and the op(lookup) is increases linearly with path-depth.
But when timeout = 1 or larger, the time-co and the op(lookup) is keep steady when path-depth increases.

#17 Updated by Xiubo Li about 2 years ago

Sheng Xie wrote:

I did a simple test and have merged your pr: Test "ls -l"the time-consuming of a directory containing 3000*4K files and count the op received by ceph-fuse daemon.
test results

The test results are exciting!
When timeout = 0,this is the entry,attr is no cache,the time-consuming and the op(lookup) is increases linearly with path-depth.
But when timeout = 1 or larger, the time-co and the op(lookup) is keep steady when path-depth increases.

Yeah, this looks really exciting and cool.

Have you test with multiple fuse clients ?

#18 Updated by Sheng Xie about 2 years ago

The test results are exciting! Can we change the default value of ‘entry_timeout’ and 'attr_timeout' from 0 to 1?

Xiubo Li wrote:

Have you test with multiple fuse clients ?

I didn't test the performance with multi client. But I verified the consistency of the 'entry' and 'attr' with multi client.

BTH,The average time-consuming of kernel client is 0.04s! so we can improve the performance of fuse clients if we find a way to reduce the getxattr’op, Do you have any more suggestions?

#19 Updated by Xiubo Li about 2 years ago

  • Tracker changed from Support to Feature
  • Status changed from New to Fix Under Review
  • Assignee set to Sheng Xie
  • Pull request ID set to 44593

#20 Updated by Patrick Donnelly over 1 year ago

  • Target version deleted (v17.0.0)

Also available in: Atom PDF