Bug #40985
closedxfstest generic/451 intermittently fails
0%
Description
xfstest generic/451 runs a program that does AIO+DIO writes to a file and then rereads the data back (via synchronous, buffered pread) to verify it. This program generally works fine, but the test also spawns a bunch of other processes that read the file over and over again using buffered reads. With the concurrent readers the verification usually (though not always) fails with stale data from the buffered read:
$ src/aio-dio-regress/aio-dio-cycle-write -c 999999 -b 655360 /mnt/cephfs/tst-aio-dio-cycle-write.451 get stale data from buffer read 00000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................ * 0009d000 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 UUUUUUUUUUUUUUUU * 0009f000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................ 000a0000
Updated by Jeff Layton over 4 years ago
I've been working on this today. ceph_direct_read_write invalidates the inode's pagecache (for the given range) at the beginning of the operation.
My hypothesis at this point is that we have the occasional buffered read that races in after that point but before the write occurs. That repopulates the invalidated pages with stale data. I suspect we'll need to re-invalidate the pages in the written range after the write reply comes in.
I have crafted a few Q&D patches to try to fix this (mainly to confirm that that is the problem), but I still see the problem occur, so either I don't understand the problem yet or the race window is larger than I think.
Updated by Jeff Layton over 4 years ago
I've been going over the nfs client code to determine why we don't see this problem there. The NFS client makes buffered and direct I/O operations mutually exclusive. While a buffered I/O operation is going on, direct I/O is blocked, and vice versa. This allows it to invalidate caches at the appropriate times. This is one of the iterations of the patch series that shifted all of that around.
We'll probably want to implement something similar:
Updated by Jeff Layton over 4 years ago
Patch submitted:
Updated by Jeff Layton over 4 years ago
- Status changed from In Progress to Resolved
Merged for v5.4-rc1