Project

General

Profile

Actions

Bug #40985

closed

xfstest generic/451 intermittently fails

Added by Jeff Layton over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

xfstest generic/451 runs a program that does AIO+DIO writes to a file and then rereads the data back (via synchronous, buffered pread) to verify it. This program generally works fine, but the test also spawns a bunch of other processes that read the file over and over again using buffered reads. With the concurrent readers the verification usually (though not always) fails with stale data from the buffered read:

$ src/aio-dio-regress/aio-dio-cycle-write -c 999999 -b 655360 /mnt/cephfs/tst-aio-dio-cycle-write.451
get stale data from buffer read
00000000  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
*
0009d000  55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55  UUUUUUUUUUUUUUUU
*
0009f000  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
000a0000
Actions #1

Updated by Jeff Layton over 4 years ago

I've been working on this today. ceph_direct_read_write invalidates the inode's pagecache (for the given range) at the beginning of the operation.

My hypothesis at this point is that we have the occasional buffered read that races in after that point but before the write occurs. That repopulates the invalidated pages with stale data. I suspect we'll need to re-invalidate the pages in the written range after the write reply comes in.

I have crafted a few Q&D patches to try to fix this (mainly to confirm that that is the problem), but I still see the problem occur, so either I don't understand the problem yet or the race window is larger than I think.

Actions #2

Updated by Jeff Layton over 4 years ago

I've been going over the nfs client code to determine why we don't see this problem there. The NFS client makes buffered and direct I/O operations mutually exclusive. While a buffered I/O operation is going on, direct I/O is blocked, and vice versa. This allows it to invalidate caches at the appropriate times. This is one of the iterations of the patch series that shifted all of that around.

We'll probably want to implement something similar:

https://www.spinics.net/lists/linux-nfs/msg58756.html

Actions #4

Updated by Jeff Layton over 4 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Jeff Layton over 4 years ago

  • Status changed from In Progress to Resolved

Merged for v5.4-rc1

Actions

Also available in: Atom PDF