Project

General

Profile

Bug #41204

CephFS pool usage 3x above expected value and sparse journal dumps

Added by Janek Bevendorff 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
Start date:
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature:

Description

I am in the process of copying about 230 million small and medium-sized files to a CephFS and I have three active MDSs to keep up with the constant create workload induced by the copy process. Previously, I was struggling heavily with runaway MDS cache grow, which was fixable by increasing the cache trim size (see issues #41140 and #41141).

Unfortunately, another problem has emerged after a few days of copying data. ceph df detail reports:

    POOL                             ID      STORED      OBJECTS     USED        %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY       USED COMPR     UNDER COMPR
    <...>
    cephfs.storage.data              108      44 TiB     176.13M     149 TiB      1.63       5.9 PiB     N/A               N/A             176.13M            0 B             0 B 
    cephfs.storage.meta              109     174 GiB      16.44M     178 GiB         0       2.9 PiB     N/A               N/A              16.44M            0 B             0 B

44 TiB of stored data looks about right, but 149 TiB actual pool usage is way beyond anything I would expect. The data pool is an EC pool with k=6, m=3 (i.e. without overhead, I would expect 66 TiB overall allocation). The metadata pool is also huge with 178 GiB (the raw uncompressed file list in plaintext format is 23 GiB).

A CephFS journal dump prints the following warnings:

2019-08-12 14:27:56.881 7fd9b5587700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
journal is 4444529514668~391241069
wrote 391241069 bytes at offset 4444529514668 to /var/lib/ceph/journal.bin.0
NOTE: this is a _sparse_ file; you can
        $ tar cSzf /var/lib/ceph/journal.bin.0.tgz /var/lib/ceph/journal.bin.0
      to efficiently compress it while preserving sparseness.
2019-08-12 14:28:09.709 7fd9b4d86700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
journal is 2887216998866~485120245
wrote 485120245 bytes at offset 2887216998866 to /var/lib/ceph/journal.bin.1
NOTE: this is a _sparse_ file; you can
        $ tar cSzf /var/lib/ceph/journal.bin.1.tgz /var/lib/ceph/journal.bin.1
      to efficiently compress it while preserving sparseness.
2019-08-12 14:28:43.241 7fd9b5d88700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
journal is 2839942271068~2529161124
wrote 2529161124 bytes at offset 2839942271068 to /var/lib/ceph/journal.bin.2
NOTE: this is a _sparse_ file; you can
        $ tar cSzf /var/lib/ceph/journal.bin.2.tgz /var/lib/ceph/journal.bin.2
      to efficiently compress it while preserving sparseness.

and then creates three sparse files of 4.1T, 2.7T, and 2.6T, respectively (actual sizes: 374M, 463M, 2.4G).

A discussion on IRC revealed that at least one other user has been struggling with this issue, which in their case resulted in a total loss of their FS requiring a full recovery from the data pool.

History

#2 Updated by super xor 3 months ago

I had the same issue before our MDS and Mons died.
Journal was producing 2 files a few TB big and the metadatapool was about 140GiB

#3 Updated by Patrick Donnelly 3 months ago

  • Target version set to v15.0.0
  • Start date deleted (08/12/2019)
  • Source set to Community (user)
  • ceph-qa-suite deleted (fs)
  • Component(FS) deleted (MDS, libcephfs)

Each file will have an object in the default data pool (the data pool used at file system creation time) with an extended attribute (xattr) storing the backtrace information. This xattr is not erasure coded (it's replicated) and may be the cause of your significant pool usage if you're dealing with so many small files. The overhead of replicated xattrs does not really correspond (I think) to what you're seeing though.

See also: https://docs.ceph.com/docs/master/cephfs/createfs/#creating-pools

I'm sorry you're stumbling across this. I'll get back to you with what the Ceph team thinks.

#4 Updated by Janek Bevendorff 3 months ago

I tried again, this time with a replicated pool and just one MDS. I think it's too early to draw definitive conclusions, but I noticed that as soon as I tried adding additional MDS ranks, the metadata pool size exploded from a couple of hundred MB to 5GB (with large fluctuation in both directions). When I reset to a single MDS, the size reduced back to 250-300MB.

#5 Updated by Janek Bevendorff 3 months ago

Little status update: our data pool now uses up 186TiB while only storing 53TiB of actual data with a replication factor of 3. That's quite a significant overhead of 27TiB. The metadata pool is 404GiB, which also appears massive to me. Meanwhile, the MDS caps at around 100 ops/s, most likely as a result of the large metadata pool size (it was several thousand in the beginning).

#6 Updated by Igor Fedotov 3 months ago

Janek Bevendorff wrote:

Little status update: our data pool now uses up 186TiB while only storing 53TiB of actual data with a replication factor of 3. That's quite a significant overhead of 27TiB. The metadata pool is 404GiB, which also appears massive to me. Meanwhile, the MDS caps at around 100 ops/s, most likely as a result of the large metadata pool size (it was several thousand in the beginning).

Given mentioned "small-sized files" I suspect wasted space is caused by bluestore allocation granularity.
In case of spinner drive the default size is 64K which means that each file/object takes at least 64K of space.
So having tons of 4K files might caused massive space waste.

So some questions to clarify if this is the case:
1) Is this bluestore?
2) What are main disk drives behind, SSDs or spinners?
3) Do you have any understanding what is size distribution for these "small-sized" files. E.g. something like that:
10% - less than 1K
20% - less than 4K
etc...
4) Can you share performance counters dumps for 2-3 osds backing cephfs.storage.data pool?

#7 Updated by Janek Bevendorff 3 months ago

It's Bluestore on spinning disks. I don't really have an overview of the data distribution, it's very uneven. Perhaps a third of the total size comes from files of a few hundred MB up to a few GB. And then we have millions of smaller files, but I doubt that we have too many of 10k or below. I would assume that most are between 100k and 10M. That's all just a ballpark guess, though. I might be totally wrong about this.

I created two dumps for your:

schema.697: https://pastebin.com/FjBzLG4A
dump.697: https://pastebin.com/HBgX62tP

schema.1061: https://pastebin.com/TnuT3QrK
dump.1061: https://pastebin.com/FaHS3NMr

Also available in: Atom PDF