Project

General

Profile

Actions

Bug #37713

closed

Centos 7 kernel client overwriting files

Added by Jozef Kováč over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I found odd bug when using cephfs for storing logs(slyslog-ng) and following logs file on another machine with tail -f, sometimes log stuck or starting overwriting at beginning of file, at first i think is syslog-ng bug but is not. Its easy to reproduce even with echo.

Two machines with cephfs kernel client:
1. for i in {1..10000}; do echo "1234" >>test; done
2. tail -f test or cat test | wc -l, less...
1. echo "xxx" >> test
1. for i in {1..10000}; do echo "456" >>test; done

and tail -f not moving, you have in test file
xxx
… (9999x 456)
456
1234
… (9999x 1234)

Sometimes even nothing happening cat test | wc -l not even lines count raising, even on machine 1.
This happen only when you try access file on another machine with cat,tail and after that comes another write to file.

This happen only with ceph kernel client on Centos 7.3~7.6 tested with ceph version 12.2.2 to 12.2.10 and 13.2.2. ( cluster ceph 12.2.8 & 13.2.2) works withou any problem if writing client is on UBUNTU 16.04/18.04 or Centos 7 with ceph fuse client.

Thx for help

Actions #1

Updated by Jozef Kováč over 5 years ago

Works also fine on Centos 7 with elrepo kernel-ml kernel.

Actions #2

Updated by Zheng Yan over 5 years ago

I think your kernel is too old. which does not handle append write correctly

Actions #3

Updated by Ilya Dryomov about 5 years ago

Hi Jozef,

Jozef Kováč wrote:

This happen only with ceph kernel client on Centos 7.3~7.6 ...

Are you saying that you observed the bug on 7.6? Which kernel version? I thought it was fixed earlier than that...

Jozef Kováč wrote:

Works also fine on Centos 7 with elrepo kernel-ml kernel.

Do you have the kernel version?

Actions #4

Updated by Ilya Dryomov about 5 years ago

  • Category set to fs/ceph
Actions #5

Updated by Jozef Kováč about 5 years ago

Ilya Dryomov wrote:

Hi Jozef,

Jozef Kováč wrote:

This happen only with ceph kernel client on Centos 7.3~7.6 ...

Are you saying that you observed the bug on 7.6? Which kernel version? I thought it was fixed earlier than that...

Problem still present with latest 3.10.0-957.5.1

Jozef Kováč wrote:

Works also fine on Centos 7 with elrepo kernel-ml kernel.

Do you have the kernel version?

elrepo ml line 4.19~4.20 works without problem.

Actions #6

Updated by Zheng Yan about 5 years ago

It's an append write bug


diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 84d17ac3bc8e4..6e71323755db1 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1487,6 +1487,7 @@ retry_snap:
            (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
                struct ceph_snap_context *snapc;
                struct iov_iter i;
+               loff_t orig_ki_pos;
                mutex_unlock(&inode->i_mutex);

                spin_lock(&ci->i_ceph_lock);
@@ -1504,11 +1505,15 @@ retry_snap:

                iov_iter_init(&i, iov, nr_segs, count, 0);

+               orig_ki_pos = iocb->ki_pos;
+               iocb->ki_pos = pos;
                if (file->f_flags & O_DIRECT)
                        written = ceph_direct_read_write(iocb, &i, snapc,
                                                         &prealloc_cf);
                else
                        written = ceph_sync_write(iocb, &i, snapc);
+               if (iocb->ki_pos == pos)
+                       iocb->ki_pos = orig_ki_pos;

                ceph_put_snap_context(snapc);
        } else {

Actions #7

Updated by Dan van der Ster about 5 years ago

We are also affected in our HPC environment and have opened a ticket with Red Hat support.

We also opened an issue with CentOS (https://bugs.centos.org/view.php?id=15953) and they have added this patch to the plus kernel. A build is available here:

https://people.centos.org/toracat/kernel/7/plus/bug15953/

I confirmed this fixes the issue for us.

Actions #8

Updated by Jozef Kováč about 5 years ago

Dan van der Ster wrote:

We are also affected in our HPC environment and have opened a ticket with Red Hat support.

We also opened an issue with CentOS (https://bugs.centos.org/view.php?id=15953) and they have added this patch to the plus kernel. A build is available here:

https://people.centos.org/toracat/kernel/7/plus/bug15953/

I confirmed this fixes the issue for us.

fixed for me too.

Actions #9

Updated by Zheng Yan almost 5 years ago

  • Status changed from New to Resolved

Fixed in 7.5 (kernel-3.10.0-862.33.1.el7) and 7.6 (kernel-3.10.0-957.16.1.el7).

Actions

Also available in: Atom PDF