Bug #17620


Data Integrity Issue with kernel client vs fuse client

Added by Aaron Bassett over 7 years ago. Updated over 5 years ago.

Target version:
% Done:


3 - minor
Affected Versions:
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):


ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
kernel: 4.4.0-42-generic #62~14.04.1-Ubuntu SMP
I have a cluster with 10 osd nodes with 5 platters on each and 1 mds. I am mounting cephfs from 21 compute nodes and using mesos to schedule jobs across them. One of my jobs uses an s3 client which allows for multipart downloads in order to attempt to speed up downloads. This code can be seen here:

This tool also includes a `verify` command which will compute an e-tag off a file on disk and compare it to an objects etag in the object store.

When mounting cephfs with the kernel client and running many downloads at once, with 3 threads each, the verify step is occasionally failing. These file's md5sums will also not match a file downloaded to a more traditional (local) filesystem. The filesize matches the good file. When diffed with `cmp`, the output looks like:

50498910209 174 |      0 ^
50498910210 202 M-^B 0 ^

50498910211 22 ^R 0 ^
50498910212 154 l 0 ^

50498910213 262 M-2 0 ^
50498910214 374 M-| 0 ^

50498910215 105 E 0 ^@

In this example, I have 59594209 0'ed out bytes in a 96G file.

When the compute nodes mount cephfs with the fuse client, I do not have any data integrity issues, however my maximum throughput is > 50% slower, so I'd really like to sort out the issue with the kernel client.

I've been watching the logs of the mds daemon and not seen it complaining about anything other than blocked requests during heavy writes. I haven't seem them go over 32s so they seem to be transient. I'm not sure if that's known/expected with cephfs or if it may be indicative of a problem. I see the blocked requests using both the kernel and fuse client. I'll note that all my osds and clients are dual 10G nics so it's very easy for a heavily loaded disk to become a bottleneck, I also will occasionally get client failing to respond to cache pressure during these heavy write periods, but

As a bit of a side note, in testing a direct single threaded download is much faster when writing to cephfs and so I will probably eventually move most of the jobs to that technique for this environments. However, any data integrity issue with the kernel client prevents me from using it at all, regardless of if I change jobs to be easier on cephfs.

Actions #1

Updated by Zheng Yan over 7 years ago

Were there any ceph related kernel message on the client hosts? (such as "ceph: mds0 caps stale")

Actions #2

Updated by Aaron Bassett over 7 years ago

No I only see:

Oct 19 15:34:52 phx-r2-r3-comp5 kernel: [683656.055247] libceph: client259809 fsid <redacted>
Oct 19 15:34:52 phx-r2-r3-comp5 kernel: [683656.056533] libceph: mon0 <redacted>:6789 session established

Actions #3

Updated by Aaron Bassett over 7 years ago

On further testing, it seems I can only make this happen when doing multi-threaded downloads from multiple hosts. I haven't been able to recreate it from a single host.

Actions #4

Updated by Zheng Yan over 7 years ago

I suspect the zeros are from stale page cache data. If you encounter the issue again, please drop the kernel page cache (echo 3 > /proc/sys/vm/drop_caches ) , and check the file again

Actions #5

Updated by John Spray over 7 years ago

  • Assignee set to Zheng Yan
Actions #6

Updated by Zheng Yan over 7 years ago

I have fixed a bug that may cause this issue. could you have a try

Actions #7

Updated by Aaron Bassett over 7 years ago

Thanks for that. I'm working on getting to the point where I can test that.

In the meantime further testing has indicated that the integration of docker and cephfs may be the culprit. It's looking like all the jobs that failed were running in docker containers with the cephfs mount volumed, and the download writing there. Does that potentially shed any new light on the situation?

Actions #8

Updated by Zheng Yan over 7 years ago

I'm not familiar with docker,how did the jobs failed.(what's the symptom)

Actions #9

Updated by Brett Niver over 7 years ago

  • Status changed from New to Need More Info
Actions #10

Updated by Aaron Bassett over 7 years ago

The jobs failed by incorrectly writing the object to disk. I can recreate this pretty easily by having two clients download the same object in 10mb chunks with 100 threads. If I run 4 of those per client (so writing to 4 files with 100 threads opening, seeking, and writing 10mb chunks on each of two clients), I will usually get at least one that fails to verify with a bunch of it 0'ed out. Strangely it doesn't seem to map to the chunk size I'm writing. I'm still doing some forensics trying to figure out if it aligns with the start or end of a chunk.

Actions #11

Updated by Zheng Yan over 5 years ago

  • Status changed from Need More Info to Resolved

splice read issue. should fixed kernel commit 7ce469a53e7106acdaca2e25027941d0f7c12a8e


Also available in: Atom PDF