Bug #53809
closedcephfs: fsync on small directory takes multiple seconds
0%
Description
Our application needs to ensure durability of written files before acknowledging that writes were done, ensuring that the file cannot be lost.
Thus, it performs `fsync()` on the written file (to make file contents durable), and `fsync()` on the directory containing the file (to make the dirent durable). (The need for that on local file systems is explained e.g. in this Ceph talk, and I believe it is necessary for Ceph as well.)
Writing a small benchmark, I noticed that on CephFS, `fsync` on a directory is extremely, unreasonably slow for this pattern:
# mkdir /mycephfs/niklas-test
# cd /mycephfs/niklas-test
# strace -fy -e fsync -T sh -c 'for i in {1..10}; do touch new-"$i"; sync .; done'
fsync(3</mycephfs/niklas-test>) = 0 <0.635923>
fsync(3</mycephfs/niklas-test>) = 0 <3.557392>
fsync(3</mycephfs/niklas-test>) = 0 <0.001497>
fsync(3</mycephfs/niklas-test>) = 0 <3.085821>
fsync(3</mycephfs/niklas-test>) = 0 <1.879268>
fsync(3</mycephfs/niklas-test>) = 0 <4.998007>
fsync(3</mycephfs/niklas-test>) = 0 <0.004683>
fsync(3</mycephfs/niklas-test>) = 0 <4.975168>
fsync(3</mycephfs/niklas-test>) = 0 <5.029351>
fsync(3</mycephfs/niklas-test>) = 0 <0.001417>
The directory fsyncs take up to 5 seconds!
This is unreasonable to me because I cannot come up with any operation on this < 10 file directory that could possibly take 5 seconds.
This CephFS is backed by 3 idle nodes with 10G networking, 0.3ms ping, metadata pool backed by enterprise NVMe SSDs, and data pool backed by spinning disks.
Version is 16.2.7, Linux 5.10.81 kernel mount.
I've tried with/without `client cache size = 0` (same results), and the same issue happens on a single-node CephFS deployment, which should excludes the possibility that the network has anything to do with it.
While the above loop is running, Ceph reports that almost no IO is going on:
# ceph osd pool stats pool device_health_metrics id 1 nothing is going on pool mycephfs_data id 2 nothing is going on pool mycephfs_metadata id 3 client io 2.3 KiB/s wr, 0 op/s rd, 1 op/s wr
Files