Bug #20938
CephFS: concurrent access to file from multiple nodes blocks for seconds
0%
Description
When accessing the same file opened for read/write on multiple nodes via ceph-fuse, performance drops by about 3 orders of magnitude to 1-2 operations/second from thousands of operations/second. Tested both on Jewel 10.2.9 and the latest Luminous RC 12.1.2.
The core of the problem boils down to the following operation being run on the same file on multiple nodes (in a loop in the test program):
int fd = open(filename, mode); read(fd, buffer, 100); close(fd);
Here are some results on our cluster:
One node, mode=read-only: 7000 opens/second
One node, mode=read-write: 7000 opens/second
Two nodes, mode=read-only: 7000 opens/second/node
Two nodes, mode=read-write: around 0.5 opens/second/node (!!!)
Two nodes, one read-only, one read-write: around 0.5 opens/second/node (!!!)
Two nodes, mode=read-write, but remove the 'read(fd, buffer,100)' line from the code: 500 opens/second/node
There seems to be some problems with opening the same file read/write and reading from the file on multiple nodes. That operation seems to be 3 orders of magnitude slower than other parallel access patterns to the same file. The 1 second time to open files almost seems like some timeout is happening somewhere. I have some suspicion that this has to do with capability management between the fuse client and the MDS, but I don't know enough about that protocol to make an educated assessment.
The attached small C program reproduces the issue. Run it on two different nodes with "timed_openrw_read <filename> rw" where <filename> is a file in cephfs with some data in it.
History
#1 Updated by Nathan Cutler almost 6 years ago
- Project changed from Ceph to CephFS
#2 Updated by Zheng Yan almost 6 years ago
I tried on latest Luminous RC + 4.12 kernel client. I got about 7000 opens/second in two nodes read-write case.
did you kernel client or fuse client? and which version?
#3 Updated by Andras Pataki almost 6 years ago
I'm only running the fuse client. I see the problem both on Jewel (10.2.9 servers + fuse client) and on Luminous RC (12.1.2 servers + corresponding fuse client).
#4 Updated by Zheng Yan almost 6 years ago
- File ceph-fuse.patch added
#5 Updated by Zheng Yan almost 6 years ago
- File deleted (
ceph-fuse.patch)
#6 Updated by Zheng Yan almost 6 years ago
- File ceph-fuse.patch View added
- Status changed from New to 12
this slowness is due to limitation of fuse API. The attached patch is a workaround. (not 100% sure it doesn't break anything)
#7 Updated by Jeff Layton almost 6 years ago
FUSE is the only caller of ->ll_lookup so a simpler fix might be to just change the mask field to 0 in the _lookup call in Client::ll_lookup. FUSE is the only caller of the ->ll_lookup operation, and it only uses st_ino, st_dev and st_rdev fields. We don't really need all of the caps there. e.g.:
diff --git a/src/client/Client.cc b/src/client/Client.cc index b1aae051b0d1..11bc85ca1787 100644 --- a/src/client/Client.cc +++ b/src/client/Client.cc @@ -10092,7 +10092,8 @@ int Client::ll_lookup(Inode *parent, const char *name, struct stat *attr, string dname(name); InodeRef in; - r = _lookup(parent, dname, CEPH_STAT_CAP_INODE_ALL, &in, perms); + // FUSE only cares about st_dev, st_ino and st_rdev + r = _lookup(parent, dname, 0, &in, perms); if (r < 0) { attr->st_ino = 0; goto out;
#8 Updated by Zheng Yan almost 6 years ago
what worry me is comment in fuse_lowlevel.h
struct fuse_entry_param { ... /** Inode attributes. * * Even if attr_timeout == 0, attr must be correct. For example, * for open(), FUSE uses attr.st_size from lookup() to determine * how many bytes to request. If this value is not correct, * incorrect data will be returned. */ struct stat attr; /** Validity timeout (in seconds) for inode attributes. If attributes only change as a result of requests that come through the kernel, this should be set to a very large value. */ double attr_timeout; ... };
#9 Updated by Jeff Layton almost 6 years ago
Ahh yeah, I remember seeing that in there a while back. I guess the danger is that we can end up instantiating an inode with almost completely bogus attributes.
Hmm...if that's the case then your original patch is also problematic, is it not? Could the inode could have gotten flushed out of the kernel core, while we still hold caps in the userland client? Kernel then issues another lookup later, and then we end up passing down an incomplete attr struct, and it instantiates the inode from that.
What we really ought to do is have fuse set some sort of flag in a new inode that the volatile attrs are bogus, and then use that flag to force a ->getattr on it before doing anything that requires up-to-date attrs.
That's a rather invasive change though if there's not already something like that in fuse that we're not aware of.
#10 Updated by Zheng Yan almost 6 years ago
static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct inode *inode = iocb->ki_filp->f_mapping->host; struct fuse_conn *fc = get_fuse_conn(inode); /* * In auto invalidate mode, always update attributes on read. * Otherwise, only update if we attempt to read past EOF (to ensure * i_size is up to date). */ if (fc->auto_inval_data || (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) { int err; err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL); if (err) return err; } return generic_file_read_iter(iocb, to); }
The comment for st_size seems outdate. I agree that "zero mask" shoule be OK for ll_lookup
#11 Updated by Jeff Layton over 5 years ago
- Component(FS) ceph-fuse added
Ok, I think you're probably right there. Do we need a cmake test to ensure the fuse library defines FUSE_AUTO_INVAL_DATA, and check that the kernel sets it on the init request?
#12 Updated by Zheng Yan over 5 years ago
kernel of RHEL7 supports FUSE_AUTO_INVAL_DATA. But FUSE_CAP_DONT_MASK was added in libfuse 3.0. Currently no major linux distribution includes libfuse 3.0.0
#13 Updated by Andras Pataki over 5 years ago
Just checking if there is any viable patch that I could try for this issue in ceph-fuse. We are running into this problem in quite a few of our codes in a very painful way, so any help would be appreciated.
#14 Updated by Patrick Donnelly over 3 years ago
- Status changed from 12 to New