Project

General

Profile

Actions

Bug #20938

open

CephFS: concurrent access to file from multiple nodes blocks for seconds

Added by Andras Pataki almost 7 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
ceph-fuse
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When accessing the same file opened for read/write on multiple nodes via ceph-fuse, performance drops by about 3 orders of magnitude to 1-2 operations/second from thousands of operations/second. Tested both on Jewel 10.2.9 and the latest Luminous RC 12.1.2.

The core of the problem boils down to the following operation being run on the same file on multiple nodes (in a loop in the test program):

    int fd = open(filename, mode);
    read(fd, buffer, 100);
    close(fd);

Here are some results on our cluster:
One node, mode=read-only: 7000 opens/second
One node, mode=read-write: 7000 opens/second
Two nodes, mode=read-only: 7000 opens/second/node
Two nodes, mode=read-write: around 0.5 opens/second/node (!!!)
Two nodes, one read-only, one read-write: around 0.5 opens/second/node (!!!)
Two nodes, mode=read-write, but remove the 'read(fd, buffer,100)' line from the code: 500 opens/second/node

There seems to be some problems with opening the same file read/write and reading from the file on multiple nodes. That operation seems to be 3 orders of magnitude slower than other parallel access patterns to the same file. The 1 second time to open files almost seems like some timeout is happening somewhere. I have some suspicion that this has to do with capability management between the fuse client and the MDS, but I don't know enough about that protocol to make an educated assessment.

The attached small C program reproduces the issue. Run it on two different nodes with "timed_openrw_read <filename> rw" where <filename> is a file in cephfs with some data in it.


Files

timed_openrw_read.c (1.28 KB) timed_openrw_read.c C test program to reproduce issue Andras Pataki, 08/07/2017 03:26 PM
ceph-fuse.patch (3.14 KB) ceph-fuse.patch Zheng Yan, 08/09/2017 10:39 AM
Actions #1

Updated by Nathan Cutler almost 7 years ago

  • Project changed from Ceph to CephFS
Actions #2

Updated by Zheng Yan almost 7 years ago

I tried on latest Luminous RC + 4.12 kernel client. I got about 7000 opens/second in two nodes read-write case.
did you kernel client or fuse client? and which version?

Actions #3

Updated by Andras Pataki almost 7 years ago

I'm only running the fuse client. I see the problem both on Jewel (10.2.9 servers + fuse client) and on Luminous RC (12.1.2 servers + corresponding fuse client).

Actions #4

Updated by Zheng Yan almost 7 years ago

  • File ceph-fuse.patch added
Actions #5

Updated by Zheng Yan almost 7 years ago

  • File deleted (ceph-fuse.patch)
Actions #6

Updated by Zheng Yan almost 7 years ago

this slowness is due to limitation of fuse API. The attached patch is a workaround. (not 100% sure it doesn't break anything)

Actions #7

Updated by Jeff Layton almost 7 years ago

FUSE is the only caller of ->ll_lookup so a simpler fix might be to just change the mask field to 0 in the _lookup call in Client::ll_lookup. FUSE is the only caller of the ->ll_lookup operation, and it only uses st_ino, st_dev and st_rdev fields. We don't really need all of the caps there. e.g.:

diff --git a/src/client/Client.cc b/src/client/Client.cc
index b1aae051b0d1..11bc85ca1787 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -10092,7 +10092,8 @@ int Client::ll_lookup(Inode *parent, const char *name, struct stat *attr,
   string dname(name);
   InodeRef in;

-  r = _lookup(parent, dname, CEPH_STAT_CAP_INODE_ALL, &in, perms);
+  // FUSE only cares about st_dev, st_ino and st_rdev
+  r = _lookup(parent, dname, 0, &in, perms);
   if (r < 0) {
     attr->st_ino = 0;
     goto out;
Actions #8

Updated by Zheng Yan almost 7 years ago

what worry me is comment in fuse_lowlevel.h

struct fuse_entry_param {

...

        /** Inode attributes.
         *
         * Even if attr_timeout == 0, attr must be correct. For example,
         * for open(), FUSE uses attr.st_size from lookup() to determine
         * how many bytes to request. If this value is not correct,
         * incorrect data will be returned.
         */
        struct stat attr;

        /** Validity timeout (in seconds) for inode attributes. If
            attributes only change as a result of requests that come
            through the kernel, this should be set to a very large
            value. */
        double attr_timeout;

...

};
Actions #9

Updated by Jeff Layton almost 7 years ago

Ahh yeah, I remember seeing that in there a while back. I guess the danger is that we can end up instantiating an inode with almost completely bogus attributes.

Hmm...if that's the case then your original patch is also problematic, is it not? Could the inode could have gotten flushed out of the kernel core, while we still hold caps in the userland client? Kernel then issues another lookup later, and then we end up passing down an incomplete attr struct, and it instantiates the inode from that.

What we really ought to do is have fuse set some sort of flag in a new inode that the volatile attrs are bogus, and then use that flag to force a ->getattr on it before doing anything that requires up-to-date attrs.

That's a rather invasive change though if there's not already something like that in fuse that we're not aware of.

Actions #10

Updated by Zheng Yan almost 7 years ago

static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
        struct inode *inode = iocb->ki_filp->f_mapping->host;
        struct fuse_conn *fc = get_fuse_conn(inode);

        /*
         * In auto invalidate mode, always update attributes on read.
         * Otherwise, only update if we attempt to read past EOF (to ensure
         * i_size is up to date).
         */
        if (fc->auto_inval_data ||
            (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
                int err;
                err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL);
                if (err)
                        return err;
        }

        return generic_file_read_iter(iocb, to);
}

The comment for st_size seems outdate. I agree that "zero mask" shoule be OK for ll_lookup

Actions #11

Updated by Jeff Layton over 6 years ago

  • Component(FS) ceph-fuse added

Ok, I think you're probably right there. Do we need a cmake test to ensure the fuse library defines FUSE_AUTO_INVAL_DATA, and check that the kernel sets it on the init request?

Actions #12

Updated by Zheng Yan over 6 years ago

kernel of RHEL7 supports FUSE_AUTO_INVAL_DATA. But FUSE_CAP_DONT_MASK was added in libfuse 3.0. Currently no major linux distribution includes libfuse 3.0.0

Actions #13

Updated by Andras Pataki over 6 years ago

Just checking if there is any viable patch that I could try for this issue in ceph-fuse. We are running into this problem in quite a few of our codes in a very painful way, so any help would be appreciated.

Actions #14

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions

Also available in: Atom PDF