client: simultaneous readdirs are very racy
Imagine we have a ceph-fuse user doing readdirs a and b on a very large directory (which requires multiple MDS round-trips, and multiple local readdir syscalls for every MDS round trip).
a finishes first. Because the directory wasn't changed, it marks the directory COMPLETE|ORDERED
b has last received an MDS readdir for offsets x to y and is serving those results
readdir c starts from offset 0.
b finishes up to y, and sends off an MDS request to readdir starting at y+1
readdir c reaches location y+1 from cache
b's response comes in. It pushes the range y+1 to z to the back of the directory's dentry xlist!
readdir c continues up to z before readdir b manages to get z+1 read back from the MDS.
readdir c ends prematurely because xlist::iterator::end() returns true.
#1 Updated by Greg Farnum almost 4 years ago
- Priority changed from Normal to High
Some obvious solutions are disqualified, both because we can't really track what directory listing's are in progress (via dirp's), and in particular because the client might just drop a readdir set or crash before finishing. So the solution needs to depend only on internal state tracking.I'm working on it. So far the winning approach is
- keep track of the shared_gen when starting an MDS listing from offset 0 (well, 2, I guess)
- when we get a response, if the shared_gen hasn't changed, set an "ordered_thru" to the latest offset
- when satisfying a readdir, reference that ordered_thru instead of the simple COMPLETE and ORDERED flags :/
There are plenty of missing parts to that, but I think the basic scheme should be sound. (It sounds just a little bit like PG backfilling...)