Project

General

Profile

Bug #37713

Centos 7 kernel client overwriting files

Added by Jozef Kováč 9 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
fs/ceph
Target version:
-
Start date:
12/19/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:

Description

I found odd bug when using cephfs for storing logs(slyslog-ng) and following logs file on another machine with tail -f, sometimes log stuck or starting overwriting at beginning of file, at first i think is syslog-ng bug but is not. Its easy to reproduce even with echo.

Two machines with cephfs kernel client:
1. for i in {1..10000}; do echo "1234" >>test; done
2. tail -f test or cat test | wc -l, less...
1. echo "xxx" >> test
1. for i in {1..10000}; do echo "456" >>test; done

and tail -f not moving, you have in test file
xxx
… (9999x 456)
456
1234
… (9999x 1234)

Sometimes even nothing happening cat test | wc -l not even lines count raising, even on machine 1.
This happen only when you try access file on another machine with cat,tail and after that comes another write to file.

This happen only with ceph kernel client on Centos 7.3~7.6 tested with ceph version 12.2.2 to 12.2.10 and 13.2.2. ( cluster ceph 12.2.8 & 13.2.2) works withou any problem if writing client is on UBUNTU 16.04/18.04 or Centos 7 with ceph fuse client.

Thx for help

History

#1 Updated by Jozef Kováč 9 months ago

Works also fine on Centos 7 with elrepo kernel-ml kernel.

#2 Updated by Zheng Yan 8 months ago

I think your kernel is too old. which does not handle append write correctly

#3 Updated by Ilya Dryomov 6 months ago

Hi Jozef,

Jozef Kováč wrote:

This happen only with ceph kernel client on Centos 7.3~7.6 ...

Are you saying that you observed the bug on 7.6? Which kernel version? I thought it was fixed earlier than that...

Jozef Kováč wrote:

Works also fine on Centos 7 with elrepo kernel-ml kernel.

Do you have the kernel version?

#4 Updated by Ilya Dryomov 6 months ago

  • Category set to fs/ceph

#5 Updated by Jozef Kováč 6 months ago

Ilya Dryomov wrote:

Hi Jozef,

Jozef Kováč wrote:

This happen only with ceph kernel client on Centos 7.3~7.6 ...

Are you saying that you observed the bug on 7.6? Which kernel version? I thought it was fixed earlier than that...

Problem still present with latest 3.10.0-957.5.1

Jozef Kováč wrote:

Works also fine on Centos 7 with elrepo kernel-ml kernel.

Do you have the kernel version?

elrepo ml line 4.19~4.20 works without problem.

#6 Updated by Zheng Yan 6 months ago

It's an append write bug


diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 84d17ac3bc8e4..6e71323755db1 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1487,6 +1487,7 @@ retry_snap:
            (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
                struct ceph_snap_context *snapc;
                struct iov_iter i;
+               loff_t orig_ki_pos;
                mutex_unlock(&inode->i_mutex);

                spin_lock(&ci->i_ceph_lock);
@@ -1504,11 +1505,15 @@ retry_snap:

                iov_iter_init(&i, iov, nr_segs, count, 0);

+               orig_ki_pos = iocb->ki_pos;
+               iocb->ki_pos = pos;
                if (file->f_flags & O_DIRECT)
                        written = ceph_direct_read_write(iocb, &i, snapc,
                                                         &prealloc_cf);
                else
                        written = ceph_sync_write(iocb, &i, snapc);
+               if (iocb->ki_pos == pos)
+                       iocb->ki_pos = orig_ki_pos;

                ceph_put_snap_context(snapc);
        } else {

#7 Updated by Dan van der Ster 6 months ago

We are also affected in our HPC environment and have opened a ticket with Red Hat support.

We also opened an issue with CentOS (https://bugs.centos.org/view.php?id=15953) and they have added this patch to the plus kernel. A build is available here:

https://people.centos.org/toracat/kernel/7/plus/bug15953/

I confirmed this fixes the issue for us.

#8 Updated by Jozef Kováč 6 months ago

Dan van der Ster wrote:

We are also affected in our HPC environment and have opened a ticket with Red Hat support.

We also opened an issue with CentOS (https://bugs.centos.org/view.php?id=15953) and they have added this patch to the plus kernel. A build is available here:

https://people.centos.org/toracat/kernel/7/plus/bug15953/

I confirmed this fixes the issue for us.

fixed for me too.

#9 Updated by Zheng Yan 3 months ago

  • Status changed from New to Resolved

Fixed in 7.5 (kernel-3.10.0-862.33.1.el7) and 7.6 (kernel-3.10.0-957.16.1.el7).

Also available in: Atom PDF