fallocate implementation on the kernel cephfs client
I remember seeing a comment somewhere (mailing list?) about this but couldn't find any reference to the issue so I decided to open a bug.
The problem: fallocate doesn't seem to be doing what it's supposed to do. I haven't been able to spend time looking at the code to understand the details, but here's a summary of the issue, on a very small test cluster:
node5:~ # df -Th /mnt Filesystem Type Size Used Avail Use% Mounted on 192.168.122.101:/ ceph 14G 228M 14G 2% /mnt
So, I have ~14G available and fallocate a big file:
node5:/mnt # xfs_io -f -c "falloc 0 1T" hugefile node5:/mnt # ls -lh total 1.0T -rw------- 1 root root 1.0T Oct 4 14:17 hugefile drwxr-xr-x 2 root root 6 Oct 4 14:17 mydir
I would expect this to fail, and it looks like the available space hasn't changed.
node5:/mnt # df -Th /mnt Filesystem Type Size Used Avail Use% Mounted on 192.168.122.101:/ ceph 14G 228M 14G 2% /mnt
Anyway, a successful call to fallocate(2) should mean that "subsequent writes into the range specified by offset and len are guaranteed not to fail because of lack of disk space". Which isn't going to be the case in the above example.
I guess that a fix for this would require a CEPH_MSG_STATFS to the monitors to get the actual free space. But as I said, I haven't spent too much time looking at the problem.
#4 Updated by Luis Henriques about 1 year ago
Patrick Donnelly wrote:
I think the way forward is to only support punch hole in fallocate for kcephfs. Same for ceph-fuse.
This would mean that a patch (for the kernel client) will be pretty easy to put together. I'll send out something to the mailing-list soon. Oh, and I guess this patch should be tagged for stable kernels as well.
Update: I've just sent out a patch  to drop support for all fallocate(2) operations but FALLOC_FL_PUNCH_HOLE.