Project

General

Profile

Feature #56949

Feature request: add ceph fs vattrib for (recursive) accounting of bytes_allocated

Added by Frank Schilder over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Add new virtual extended attributes for storage usage accounting:

- ceph.file.abytes # bytes allocated in pool
- ceph.dir.abytes # bytes allocated in pool
- ceph.dir.rabytes # recursive allocation in pool

The abytes attribute should report the allocation on disk for a file/directory and rabytes the recursive allocation of all files and directories below a directory. The implementation could progress step by step, starting with simple approximations and improving in accuracy over time.

Note that allocation on pool excludes the replication factor. To obtain bytes allocated on disk (raw), one should multiply bytes allocated in pool with the replication factor of the pool. This seems the only meaningful and least confusing way to handle the situation with files/subdirectories placed on pools with different replication factors.

Why is this interesting?

We are reporting storage usage per user and this report should be based on the allocation of their data in a pool and not on the bytes saved. It would also make it possible to do meaningful invoicing based on actual allocation in commercial applications.

For example, for our 8+3 EC data pool with bluestore_min_alloc_size_hdd=64K allocation of space happens in chunks of 512K (usable capacity) per file. This is a quite large allocation and if many small files exist, this can lead to an extreme allocation amplification.

A use case for this are HPC frameworks like openFoam that create hundreds of millions of small files. It would really be helpful to account for bytes_allocated to either motivate users to tar these directories after the compute run completed, or pay for the extra hardware they are requiring instead of eating up capacity of everyone else. To show the discrepancy with a specific example, on an octopus test system with a 4+2 EC pool and bluestore_min_alloc_size_hdd=64K df reports

10.41.24.13,10.41.24.14,10.41.24.15:/      2.5T  173G  2.3T   7% /mnt/adm/cephfs
10.41.24.13,10.41.24.14,10.41.24.15:/data  2.0T   37G  2.0T   2% /mnt/cephfs

The folder "data" contains all data. The difference is, that on "data" a quota is set, while on "/" there isn't. On "/" df simply reports the same as "ceph df" while on "/data" df reports "ceph.dir.rbytes". We observe here a storage allocation amplification of a factor of 5. The contents of "data" is produced by creating sub-dirs, each populated by copying a large ISO and un-taring an anaconda2 installation into it:

# ls -lh /mnt/ram/
total 4.9G
-rw-r--r--. 1 root root 1.5G Jul 12 13:16 anaconda2.tgz
-rw-r--r--. 1 root root 3.5G Jul 26 10:58 ubuntu-22.04-desktop-amd64.iso

The anaconda2 package creates a large amount of small files and hard links. In this test, we have 6 copies of the ISO and 5 installations of anaconda. Subtracting the 6*3.5G from the 173G allocated, we get that each anaconda2 install allocates 30GB for a total of 3.1G installed bytes:

# getfattr -n ceph.dir.rbytes /mnt/cephfs/blobs/2/anaconda2/
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs/blobs/2/anaconda2/
ceph.dir.rbytes="3388381665" 

This is an allocation amplification by a factor of 10! It would be great to have a virtual extended attribute one could use to extract the 30G allocated. Also, this would be the correct attribute to use in df to show used and %used.

Alternatively, an implementation of tail-merging would solve the issue in a more elegant way, by avoiding this excessive over-allocation in the first place.

Also available in: Atom PDF