Project

General

Profile

Actions

Bug #65199

open

autoscaler: Scale PGs based on number of objects

Added by Niklas Hambuechen about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph's autoscaler scales PGs based on Bytes stored. It seemingly ignores number of objects. This creates problems for pools with many small files.

It creates even more problems for pools with an apparent byte size of 0, but millions of objects; such pools get created when following CephFS-on-EC best practices in the docs.

Red Hat docs describe:

https://access.redhat.com/documentation/de-de/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs#viewing-placement-group-scaling-recommendations

BIAS, is a pool property that is used by the PG autoscaler to scale some pools faster than others, in terms of number of PGs. It is essentially a multiplier used to give more PG to a pool than the default number of PGs. This property is particularly used for metadata pools which might be small in size but have large number of objects, so scaling them faster is important for better performance.`

(Note these docs are better than the upstream Ceph docs on BIAS, which are much shorter: https://docs.ceph.com/en/reef/rados/operations/placement-groups/)

So this confirms that BIAS (pg_autoscale_bias) can be used to partially address the "many small objects", using a constant factor.

But the constant factor stops working when the objects are 0-sized.

This happens when following CephFS best practices: https://docs.ceph.com/en/reef/cephfs/createfs/#creating-pools

The data pool used to create the file system is the “default” data pool and the location for storing all inode backtrace information, which is used for hard link management and disaster recovery. For this reason, all CephFS inodes have at least one object in the default data pool.
If erasure-coded pools are planned for file system data, it is best to configure the default as a replicated pool to improve small-object write and read performance when updating backtraces. Separately, another erasure-coded data pool can be added (see also Erasure code) that can be used on an entire hierarchy of directories and files (see also File layouts).

If you do what is described here ("default" pool on replicated, directory File Layout on EC), you end up with pools like this in `ceph df`:

POOL      ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
.mgr       1    1  203 MiB       26  609 MiB   90.00      5 GiB
data       2   32      0 B  112.23M      0 B       0     61 TiB
data_ec    3  168  124 TiB  115.30M  186 TiB   50.53    121 TiB
metadata   4  128   63 GiB   32.87k  189 GiB   90.00      5 GiB

Note how the `data` pool that stores the inodes bas 112 M objects but 0 Bytes stored. Apparently the inodes

Because the data size is low (0), the autoscaler assigns no more than 32 PGs.

This means that there are ~4 M objects per PG. If the objects are on HDD that can do 100 seeks per second, running scrubbing, recovery, or balancing (which needs to seek all objects in a PG) will take at least 11 hours. And this does not even take EC overhead factors into account.

If there were 1 B objects, handling a single PG would take > 100 hours.

There seems to be nothing in Ceph that scales PGs based on number of objects. This issue requests that to be added.

This would:

  • Fix that CephFS EC recommendations actually make sense and do not produce operational problems.
  • Improve Ceph's default behaviour for many small files/objects, without the user manually having to set BIAS.
Actions #1

Updated by Niklas Hambuechen about 1 month ago

Aside:

It also feels wrong that the inode information shows up as "0 B". It looks like storage is gone missing inexplicably.

I can only estimate the actual size of the "default" `data` pool with `ceph df` by computing `(AVAIL of all SSDs - metadata pool STORED - metadat pool MAX AVAIL)/(replication size)`.

Is there a way Ceph can directly show the space used by the inode backtrace information?

Actions

Also available in: Atom PDF