CephFS pool usage 3x above expected value and sparse journal dumps
I am in the process of copying about 230 million small and medium-sized files to a CephFS and I have three active MDSs to keep up with the constant create workload induced by the copy process. Previously, I was struggling heavily with runaway MDS cache grow, which was fixable by increasing the cache trim size (see issues #41140 and #41141).
Unfortunately, another problem has emerged after a few days of copying data.
ceph df detail reports:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR <...> cephfs.storage.data 108 44 TiB 176.13M 149 TiB 1.63 5.9 PiB N/A N/A 176.13M 0 B 0 B cephfs.storage.meta 109 174 GiB 16.44M 178 GiB 0 2.9 PiB N/A N/A 16.44M 0 B 0 B
44 TiB of stored data looks about right, but 149 TiB actual pool usage is way beyond anything I would expect. The data pool is an EC pool with k=6, m=3 (i.e. without overhead, I would expect 66 TiB overall allocation). The metadata pool is also huge with 178 GiB (the raw uncompressed file list in plaintext format is 23 GiB).
A CephFS journal dump prints the following warnings:
2019-08-12 14:27:56.881 7fd9b5587700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol journal is 4444529514668~391241069 wrote 391241069 bytes at offset 4444529514668 to /var/lib/ceph/journal.bin.0 NOTE: this is a _sparse_ file; you can $ tar cSzf /var/lib/ceph/journal.bin.0.tgz /var/lib/ceph/journal.bin.0 to efficiently compress it while preserving sparseness. 2019-08-12 14:28:09.709 7fd9b4d86700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol journal is 2887216998866~485120245 wrote 485120245 bytes at offset 2887216998866 to /var/lib/ceph/journal.bin.1 NOTE: this is a _sparse_ file; you can $ tar cSzf /var/lib/ceph/journal.bin.1.tgz /var/lib/ceph/journal.bin.1 to efficiently compress it while preserving sparseness. 2019-08-12 14:28:43.241 7fd9b5d88700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol journal is 2839942271068~2529161124 wrote 2529161124 bytes at offset 2839942271068 to /var/lib/ceph/journal.bin.2 NOTE: this is a _sparse_ file; you can $ tar cSzf /var/lib/ceph/journal.bin.2.tgz /var/lib/ceph/journal.bin.2 to efficiently compress it while preserving sparseness.
and then creates three sparse files of 4.1T, 2.7T, and 2.6T, respectively (actual sizes: 374M, 463M, 2.4G).
A discussion on IRC revealed that at least one other user has been struggling with this issue, which in their case resulted in a total loss of their FS requiring a full recovery from the data pool.
#3 Updated by Patrick Donnelly 12 days ago
- Target version set to v15.0.0
- Start date deleted (
- Source set to Community (user)
- ceph-qa-suite deleted (
- Component(FS) deleted (
Each file will have an object in the default data pool (the data pool used at file system creation time) with an extended attribute (xattr) storing the backtrace information. This xattr is not erasure coded (it's replicated) and may be the cause of your significant pool usage if you're dealing with so many small files. The overhead of replicated xattrs does not really correspond (I think) to what you're seeing though.
I'm sorry you're stumbling across this. I'll get back to you with what the Ceph team thinks.
#4 Updated by Janek Bevendorff 11 days ago
I tried again, this time with a replicated pool and just one MDS. I think it's too early to draw definitive conclusions, but I noticed that as soon as I tried adding additional MDS ranks, the metadata pool size exploded from a couple of hundred MB to 5GB (with large fluctuation in both directions). When I reset to a single MDS, the size reduced back to 250-300MB.