Project

General

Profile

Bug #51948

Cephfs EL8.4 kernel client append mode corruption

Added by Jason Borden 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

A file corruption issue is occurring when using the cephfs kernel client on EL8.4 kernels (4.18.0-305.3.1 - 4.18.0-305.10.2) although may have existed prior to EL8.4 . The corruption occurs on files open for appending . Files will sometimes become corrupted with 0's overwriting portions of the file's contents . The overwritten portions always seem to occur starting at a 4Kbyte boundary . To reproduce the bug I run:

$ for y in {1..1000} ; do echo "$(date -Isec) [testing] Running test number $y" >> test.log ; sleep 1 ; done

This will often yield a corrupted file, although not always . For example a test I ran yielded the following results:

$ file test.log
test.log: data

$ hexdump -C test.log | head -8
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| *
00000fd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 32 30 |..............20|
00000fe0 32 31 2d 30 37 2d 31 39 54 31 33 3a 33 37 3a 34 |21-07-19T13:37:4|
00000ff0 34 2d 30 36 3a 30 30 20 5b 74 65 73 74 69 6e 67 |4-06:00 [testing|
00001000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| *
00001d50 00 00 00 00 00 00 00 00 32 30 32 31 2d 30 37 2d |........2021-07-|

As you can see in this instance corruption occurred at offset 0 and at offset 0x1000 . The issue does not occur with the ceph fuse client or with kernel 5.4.135, but does affect EL8.4 kernels at a minimum . It may be coincidence, but the corruption seems like it occurs more often when another client is also reading the file . I have had this issue when my cluster was running octopus or pacific with 1 or 2 active mds . I have attached a sample corrupted file .

test.log View (58.5 KB) Jason Borden, 07/28/2021 07:59 PM

History

#1 Updated by Ilya Dryomov 3 months ago

  • Category set to fs/ceph
  • Assignee set to Jeff Layton
  • Priority changed from Normal to High

#2 Updated by Jeff Layton 3 months ago

  • Status changed from New to Resolved

This is a known bug now fixed in mainline. 8.4.z should get this patch too, but I'm not yet clear on the ETA. See:

https://bugzilla.redhat.com/show_bug.cgi?id=1971101

Since this doesn't concern upstream kernels, I'm going to close this as Resolved. Hopefully RHEL 8.4.z should get the fix soon.

Also available in: Atom PDF