Bug #37713
closedCentos 7 kernel client overwriting files
0%
Description
I found odd bug when using cephfs for storing logs(slyslog-ng) and following logs file on another machine with tail -f, sometimes log stuck or starting overwriting at beginning of file, at first i think is syslog-ng bug but is not. Its easy to reproduce even with echo.
Two machines with cephfs kernel client:
1. for i in {1..10000}; do echo "1234" >>test; done
2. tail -f test or cat test | wc -l, less...
1. echo "xxx" >> test
1. for i in {1..10000}; do echo "456" >>test; done
and tail -f not moving, you have in test file
xxx
… (9999x 456)
456
1234
… (9999x 1234)
Sometimes even nothing happening cat test | wc -l not even lines count raising, even on machine 1.
This happen only when you try access file on another machine with cat,tail and after that comes another write to file.
This happen only with ceph kernel client on Centos 7.3~7.6 tested with ceph version 12.2.2 to 12.2.10 and 13.2.2. ( cluster ceph 12.2.8 & 13.2.2) works withou any problem if writing client is on UBUNTU 16.04/18.04 or Centos 7 with ceph fuse client.
Thx for help
Updated by Jozef Kováč over 5 years ago
Works also fine on Centos 7 with elrepo kernel-ml kernel.
Updated by Zheng Yan over 5 years ago
I think your kernel is too old. which does not handle append write correctly
Updated by Ilya Dryomov about 5 years ago
Hi Jozef,
Jozef Kováč wrote:
This happen only with ceph kernel client on Centos 7.3~7.6 ...
Are you saying that you observed the bug on 7.6? Which kernel version? I thought it was fixed earlier than that...
Jozef Kováč wrote:
Works also fine on Centos 7 with elrepo kernel-ml kernel.
Do you have the kernel version?
Updated by Jozef Kováč about 5 years ago
Ilya Dryomov wrote:
Hi Jozef,
Jozef Kováč wrote:
This happen only with ceph kernel client on Centos 7.3~7.6 ...
Are you saying that you observed the bug on 7.6? Which kernel version? I thought it was fixed earlier than that...
Problem still present with latest 3.10.0-957.5.1
Jozef Kováč wrote:
Works also fine on Centos 7 with elrepo kernel-ml kernel.
Do you have the kernel version?
elrepo ml line 4.19~4.20 works without problem.
Updated by Zheng Yan about 5 years ago
It's an append write bug
diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 84d17ac3bc8e4..6e71323755db1 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1487,6 +1487,7 @@ retry_snap: (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) { struct ceph_snap_context *snapc; struct iov_iter i; + loff_t orig_ki_pos; mutex_unlock(&inode->i_mutex); spin_lock(&ci->i_ceph_lock); @@ -1504,11 +1505,15 @@ retry_snap: iov_iter_init(&i, iov, nr_segs, count, 0); + orig_ki_pos = iocb->ki_pos; + iocb->ki_pos = pos; if (file->f_flags & O_DIRECT) written = ceph_direct_read_write(iocb, &i, snapc, &prealloc_cf); else written = ceph_sync_write(iocb, &i, snapc); + if (iocb->ki_pos == pos) + iocb->ki_pos = orig_ki_pos; ceph_put_snap_context(snapc); } else {
Updated by Dan van der Ster about 5 years ago
We are also affected in our HPC environment and have opened a ticket with Red Hat support.
We also opened an issue with CentOS (https://bugs.centos.org/view.php?id=15953) and they have added this patch to the plus kernel. A build is available here:
https://people.centos.org/toracat/kernel/7/plus/bug15953/
I confirmed this fixes the issue for us.
Updated by Jozef Kováč about 5 years ago
Dan van der Ster wrote:
We are also affected in our HPC environment and have opened a ticket with Red Hat support.
We also opened an issue with CentOS (https://bugs.centos.org/view.php?id=15953) and they have added this patch to the plus kernel. A build is available here:
https://people.centos.org/toracat/kernel/7/plus/bug15953/
I confirmed this fixes the issue for us.
fixed for me too.
Updated by Zheng Yan almost 5 years ago
- Status changed from New to Resolved
Fixed in 7.5 (kernel-3.10.0-862.33.1.el7) and 7.6 (kernel-3.10.0-957.16.1.el7).