Project

General

Profile

Bug #41204

CephFS pool usage 3x above expected value and sparse journal dumps

Added by Janek Bevendorff 12 days ago. Updated 11 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
Start date:
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:

Description

I am in the process of copying about 230 million small and medium-sized files to a CephFS and I have three active MDSs to keep up with the constant create workload induced by the copy process. Previously, I was struggling heavily with runaway MDS cache grow, which was fixable by increasing the cache trim size (see issues #41140 and #41141).

Unfortunately, another problem has emerged after a few days of copying data. ceph df detail reports:

    POOL                             ID      STORED      OBJECTS     USED        %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY       USED COMPR     UNDER COMPR
    <...>
    cephfs.storage.data              108      44 TiB     176.13M     149 TiB      1.63       5.9 PiB     N/A               N/A             176.13M            0 B             0 B 
    cephfs.storage.meta              109     174 GiB      16.44M     178 GiB         0       2.9 PiB     N/A               N/A              16.44M            0 B             0 B

44 TiB of stored data looks about right, but 149 TiB actual pool usage is way beyond anything I would expect. The data pool is an EC pool with k=6, m=3 (i.e. without overhead, I would expect 66 TiB overall allocation). The metadata pool is also huge with 178 GiB (the raw uncompressed file list in plaintext format is 23 GiB).

A CephFS journal dump prints the following warnings:

2019-08-12 14:27:56.881 7fd9b5587700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
journal is 4444529514668~391241069
wrote 391241069 bytes at offset 4444529514668 to /var/lib/ceph/journal.bin.0
NOTE: this is a _sparse_ file; you can
        $ tar cSzf /var/lib/ceph/journal.bin.0.tgz /var/lib/ceph/journal.bin.0
      to efficiently compress it while preserving sparseness.
2019-08-12 14:28:09.709 7fd9b4d86700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
journal is 2887216998866~485120245
wrote 485120245 bytes at offset 2887216998866 to /var/lib/ceph/journal.bin.1
NOTE: this is a _sparse_ file; you can
        $ tar cSzf /var/lib/ceph/journal.bin.1.tgz /var/lib/ceph/journal.bin.1
      to efficiently compress it while preserving sparseness.
2019-08-12 14:28:43.241 7fd9b5d88700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
journal is 2839942271068~2529161124
wrote 2529161124 bytes at offset 2839942271068 to /var/lib/ceph/journal.bin.2
NOTE: this is a _sparse_ file; you can
        $ tar cSzf /var/lib/ceph/journal.bin.2.tgz /var/lib/ceph/journal.bin.2
      to efficiently compress it while preserving sparseness.

and then creates three sparse files of 4.1T, 2.7T, and 2.6T, respectively (actual sizes: 374M, 463M, 2.4G).

A discussion on IRC revealed that at least one other user has been struggling with this issue, which in their case resulted in a total loss of their FS requiring a full recovery from the data pool.

History

#2 Updated by super xor 12 days ago

I had the same issue before our MDS and Mons died.
Journal was producing 2 files a few TB big and the metadatapool was about 140GiB

#3 Updated by Patrick Donnelly 12 days ago

  • Target version set to v15.0.0
  • Start date deleted (08/12/2019)
  • Source set to Community (user)
  • ceph-qa-suite deleted (fs)
  • Component(FS) deleted (MDS, libcephfs)

Each file will have an object in the default data pool (the data pool used at file system creation time) with an extended attribute (xattr) storing the backtrace information. This xattr is not erasure coded (it's replicated) and may be the cause of your significant pool usage if you're dealing with so many small files. The overhead of replicated xattrs does not really correspond (I think) to what you're seeing though.

See also: https://docs.ceph.com/docs/master/cephfs/createfs/#creating-pools

I'm sorry you're stumbling across this. I'll get back to you with what the Ceph team thinks.

#4 Updated by Janek Bevendorff 11 days ago

I tried again, this time with a replicated pool and just one MDS. I think it's too early to draw definitive conclusions, but I noticed that as soon as I tried adding additional MDS ranks, the metadata pool size exploded from a couple of hundred MB to 5GB (with large fluctuation in both directions). When I reset to a single MDS, the size reduced back to 250-300MB.

Also available in: Atom PDF