Bug #40985
closed
xfstest generic/451 intermittently fails
Added by Jeff Layton almost 5 years ago.
Updated over 4 years ago.
Description
xfstest generic/451 runs a program that does AIO+DIO writes to a file and then rereads the data back (via synchronous, buffered pread) to verify it. This program generally works fine, but the test also spawns a bunch of other processes that read the file over and over again using buffered reads. With the concurrent readers the verification usually (though not always) fails with stale data from the buffered read:
$ src/aio-dio-regress/aio-dio-cycle-write -c 999999 -b 655360 /mnt/cephfs/tst-aio-dio-cycle-write.451
get stale data from buffer read
00000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
*
0009d000 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 UUUUUUUUUUUUUUUU
*
0009f000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
000a0000
I've been working on this today. ceph_direct_read_write invalidates the inode's pagecache (for the given range) at the beginning of the operation.
My hypothesis at this point is that we have the occasional buffered read that races in after that point but before the write occurs. That repopulates the invalidated pages with stale data. I suspect we'll need to re-invalidate the pages in the written range after the write reply comes in.
I have crafted a few Q&D patches to try to fix this (mainly to confirm that that is the problem), but I still see the problem occur, so either I don't understand the problem yet or the race window is larger than I think.
I've been going over the nfs client code to determine why we don't see this problem there. The NFS client makes buffered and direct I/O operations mutually exclusive. While a buffered I/O operation is going on, direct I/O is blocked, and vice versa. This allows it to invalidate caches at the appropriate times. This is one of the iterations of the patch series that shifted all of that around.
We'll probably want to implement something similar:
https://www.spinics.net/lists/linux-nfs/msg58756.html
- Status changed from New to In Progress
- Status changed from In Progress to Resolved
Also available in: Atom
PDF