Documentation #44503
openDocument CephFS's behaviour on O_APPEND
0%
Description
I have noticed that on my CephFS (13.2.2) file system mounted via fuse, if multiple writers `O_APPEND` to a file simultaneously while keeping the FD open (the typical logging use case), many bytes get lost.
I have been trying to figure out what the intended, or current, semantics are, but it seems the documentation is insufficient and should be improved.
On https://docs.ceph.com/docs/master/cephfs/posix/ `O_APPEND` is not mentioned. The only thing that sounds tangentially relevant is
In shared simultaneous writer situations, a write that crosses object boundaries is not necessarily atomic. This means that you could have writer A write “aa|aa” and writer B write “bb|bb” simultaneously (where | is the object boundary), and end up with “aa|bb” rather than the proper “aa|aa” or “bb|bb”.
However, even that doesn't quite catch it, because with `O_APPEND` I would expect "aaaabbbb", "aabbaabb" or any other interleaving of these 8 characters.
Beyond that, I could only find:
- A mailing list post from 2015 (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004280.html) with a quote "This fix is still racy for multiple writer case. If you want strict append behaviour, please wrap each write with file lock" -- no idea if that is still the situation
- #17564 (maybe related?)
- #2825 (maybe related?)
None of that qualifies as proper documentation.
Regarding the "please wrap each write with file lock" hint, it is also unclear from the same documentation page how good CephFS's lock support is (see also http://0pointer.de/blog/projects/locking.html for the general problem and the various choices).
I think https://docs.ceph.com/docs/master/cephfs/posix/ should be extended to document how users should expect `O_APPEND` to behave. It would be extremely useful.
Thanks!