Project

General

Profile

Bug #54136

pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings

Added by Christian Rohmann about 2 years ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Category:
pg_autoscaler module
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef, quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am debugging a mgr pg_autoscaler WARN which states a target_size_bytes on a pool would overcommit the available storage.
There is only one pool with value for target_size_bytes (=5T) defined and that apparently would consume more than the available storage:

# ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes
[WRN] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED: 1 subtrees have overcommitted pool target_size_bytes
    Pools ['backups', 'images', 'device_health_metrics', '.rgw.root', 'redacted.rgw.control', 'redacted.rgw.meta', 'redacted.rgw.log', 'redacted.rgw.otp', 'redacted.rgw.buckets.index', 'redacted.rgw.buckets.data', 'redacted.rgw.buckets.non-ec'] overcommit available storage by 1.011x due to target_size_bytes 15.0T on pools ['redacted.rgw.buckets.data'].

But then looking at the actual usage it seems strange that 15T (5T * 3 replicas) should not fit onto the remaining 122 TiB AVAIL:

# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    293 TiB  122 TiB  171 TiB   171 TiB      58.44
TOTAL  293 TiB  122 TiB  171 TiB   171 TiB      58.44

--- POOLS ---
POOL                             ID  PGS   STORED   (DATA)   (OMAP)   OBJECTS  USED     (DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
backups                           1  1024   92 TiB   92 TiB  3.8 MiB   28.11M  156 TiB  156 TiB   11 MiB  64.77     28 TiB  N/A            N/A            N/A      39 TiB      123 TiB
images                            2    64  1.7 TiB  1.7 TiB  249 KiB  471.72k  5.2 TiB  5.2 TiB  748 KiB   5.81     28 TiB  N/A            N/A            N/A         0 B          0 B
device_health_metrics            19     1   82 MiB      0 B   82 MiB       43  245 MiB      0 B  245 MiB      0     28 TiB  N/A            N/A            N/A         0 B          0 B
.rgw.root                        21    32   23 KiB   23 KiB      0 B       25  4.1 MiB  4.1 MiB      0 B      0     28 TiB  N/A            N/A            N/A         0 B          0 B
redacted.rgw.control             22    32      0 B      0 B      0 B        8      0 B      0 B      0 B      0     28 TiB  N/A            N/A            N/A         0 B          0 B
redacted.rgw.meta                23    32  1.7 MiB  394 KiB  1.3 MiB    1.38k  237 MiB  233 MiB  3.9 MiB      0     28 TiB  N/A            N/A            N/A         0 B          0 B
redacted.rgw.log                 24    32   53 MiB  500 KiB   53 MiB    7.60k  204 MiB   47 MiB  158 MiB      0     28 TiB  N/A            N/A            N/A         0 B          0 B
redacted.rgw.otp                 25    32  5.2 KiB      0 B  5.2 KiB        0   16 KiB      0 B   16 KiB      0     28 TiB  N/A            N/A            N/A         0 B          0 B
redacted.rgw.buckets.index       26    32  1.2 GiB      0 B  1.2 GiB    7.46k  3.5 GiB      0 B  3.5 GiB      0     28 TiB  N/A            N/A            N/A         0 B          0 B
redacted.rgw.buckets.data        27   128  3.1 TiB  3.1 TiB      0 B    3.53M  9.5 TiB  9.5 TiB      0 B  10.11     28 TiB  N/A            N/A            N/A         0 B          0 B
redacted.rgw.buckets.non-ec      28    32      0 B      0 B      0 B        0      0 B      0 B      0 B      0     28 TiB  N/A            N/A            N/A         0 B          0 B

I then looked at how those values are determined at https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L509.
Apparently "total_bytes" are compared with the capacity of the root_map. I added a debug line and found that the total in my cluster was already at:

total=325511007759696

so in excess of 300 TiB - Looking at "ceph df" again this usage seems strange.

Looking at how this total is calculated at https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L441,
you see that the larger value (max) of "actual_raw_used" vs. "target_bytes*raw_used_rate" is determined and then summed up.

I dumped the values for all pools my cluster with yet another line of debug code:


pool_id 1 - actual_raw_used=303160109187420.0, target_bytes=0 raw_used_rate=3.0
pool_id 2 - actual_raw_used=5714098884702.0, target_bytes=0 raw_used_rate=3.0
pool_id 19 - actual_raw_used=256550760.0, target_bytes=0 raw_used_rate=3.0
pool_id 21 - actual_raw_used=71433.0, target_bytes=0 raw_used_rate=3.0
pool_id 22 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0
pool_id 23 - actual_raw_used=5262798.0, target_bytes=0 raw_used_rate=3.0
pool_id 24 - actual_raw_used=162299940.0, target_bytes=0 raw_used_rate=3.0
pool_id 25 - actual_raw_used=16083.0, target_bytes=0 raw_used_rate=3.0
pool_id 26 - actual_raw_used=3728679936.0, target_bytes=0 raw_used_rate=3.0
pool_id 27 - actual_raw_used=10035209699328.0, target_bytes=5497558138880 raw_used_rate=3.0
pool_id 28 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0

All values but those of pool_id 1 (backups) make sense. For backups it's just reporting a MUCH larger actual_raw_used value than what is shown via ceph df.

The only difference of that pool compared to the others is the enabled compression:

# ceph osd pool get backups compression_mode
compression_mode: aggressive

Apparently there already was a similar issue (https://tracker.ceph.com/issues/41567) with a resulting commit (https://github.com/ceph/ceph/commit/dd6e752826bc762095be4d276e3c1b8d31293eb0)
changing which from "bytes_used" to the "stored" field for "pool_logical_used".

But how does that take compressed (away) data into account? Does "bytes_used" count all the "stored" bytes, summing up all uncompressed bytes for pools with compression?
This surely must be a bug then, as those bytes are not really "actual_raw_used".


Related issues

Copied to mgr - Backport #62888: quincy: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings Resolved
Copied to mgr - Backport #62889: reef: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings Resolved
Copied to mgr - Backport #62890: pacific: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings Rejected

History

#1 Updated by Neha Ojha over 1 year ago

  • Category set to pg_autoscaler module
  • Assignee set to Kamoltat (Junior) Sirivadhna

#2 Updated by Christian Rohmann over 1 year ago

Thanks Kamoltat for looking into this issue. Please let me know if there are any more details I could provide.

#3 Updated by Christian Rohmann about 1 year ago

May I ask if you did get the chance to verify this?

#4 Updated by Kamoltat (Junior) Sirivadhna 10 months ago

Thank you so much for filing this issue and including all the details, I'm currently looking into this.

#5 Updated by Christian Rohmann 10 months ago

Kamoltat (Junior) Sirivadhna wrote:

Thank you so much for filing this issue and including all the details, I'm currently looking into this.

Awesome. Please let me know if there are any more detail that I could provide from our cluster to help you narrow this down.

#7 Updated by Kamoltat (Junior) Sirivadhna 10 months ago

  • Status changed from New to Fix Under Review

#8 Updated by Christian Rohmann 9 months ago

Thanks again for the proposed fix.

I am wondering about yet another value in the POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warning, the "target_size_bytes".
Here you go with the details of my cluster:

# ceph health detail
[WRN] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED: 1 subtrees have overcommitted pool target_size_bytes
    Pools ['backups', 'images', '.mgr', '.rgw.root', 'redacted.rgw.control', 'redacted.rgw.meta', 'redacted.rgw.log', 'redacted.rgw.otp', 'redacted.rgw.buckets.index', 'redacted.rgw.buckets.data', 'redacted.rgw.buckets.non-ec'] overcommit available storage by 1.114x due to target_size_bytes  255T on pools ['backups', 'redacted.rgw.buckets.data']

# ceph osd pool autoscale-status
POOL                               SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
backups                          105.7T       76800G   3.0        312.9T  1.0138                                  1.0    1024              on         False  
images                           414.8G                3.0        312.9T  0.0039                                  1.0      64              on         False  
.mgr                             438.9M                3.0        312.9T  0.0000                                  1.0       1              on         False  
.rgw.root                        37941                 3.0        312.9T  0.0000                                  1.0      32              on         False  
redacted.rgw.control             0                 3.0        312.9T  0.0000                                  1.0      32              on         False  
redacted.rgw.meta            744.5k                3.0        312.9T  0.0000                                  1.0      32              on         False  
redacted.rgw.log             236.0M                3.0        312.9T  0.0000                                  1.0      32              on         False  
redacted.rgw.otp              1394                 3.0        312.9T  0.0000                                  1.0      32              on         False  
redacted.rgw.buckets.index   701.6M                3.0        312.9T  0.0000                                  1.0      32              on         False  
redacted.rgw.buckets.data     9866G       10240G   3.0        312.9T  0.0959                                  1.0      32              on         False  
redacted.rgw.buckets.non-ec      0                 3.0        312.9T  0.0000                                  1.0      32              on         False  

# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    313 TiB  101 TiB  212 TiB   212 TiB      67.73
TOTAL  313 TiB  101 TiB  212 TiB   212 TiB      67.73

--- POOLS ---
POOL                             ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
backups                           1  1024  106 TiB   31.66M  180 TiB  79.54     15 TiB
images                            2    64  415 GiB   78.70k  1.2 TiB   2.56     15 TiB
.mgr                             19     1  439 MiB       83  1.3 GiB      0     15 TiB
.rgw.root                        21    32   37 KiB       31  4.9 MiB      0     15 TiB
redacted.rgw.control         22    32      0 B        8      0 B      0     15 TiB
redacted.rgw.meta            23    32  745 KiB    2.55k  322 MiB      0     15 TiB
redacted.rgw.log             24    32  236 MiB   11.81k  853 MiB      0     15 TiB
redacted.rgw.otp             25    32  1.4 KiB        0  4.1 KiB      0     15 TiB
redacted.rgw.buckets.index   26    32  702 MiB   13.36k  2.1 GiB      0     15 TiB
redacted.rgw.buckets.data    27    32  9.6 TiB   11.16M   30 TiB  39.35     15 TiB
redacted.rgw.buckets.non-ec  28    32      0 B        0      0 B      0     15 TiB

As you can see it complains about over-commitment "due to target_size_bytes 255T on pools".
But the sum of the target sizes is actually only 1/3 of 255T, apparently due to the replication factor.

#9 Updated by Kamoltat (Junior) Sirivadhna 6 months ago

https://github.com/ceph/ceph/pull/51921 should fix your last comment: https://tracker.ceph.com/issues/54136#note-8

Please verify if the fix worked, by maybe spinning up a new cluster with the latest ceph:main upstream branch and create the same pools as you did here.

#10 Updated by Kamoltat (Junior) Sirivadhna 6 months ago

I'll leave the tracker open for a while

#11 Updated by Kamoltat (Junior) Sirivadhna 6 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef, quincy, pacific

#12 Updated by Backport Bot 6 months ago

  • Copied to Backport #62888: quincy: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings added

#13 Updated by Backport Bot 6 months ago

  • Copied to Backport #62889: reef: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings added

#14 Updated by Backport Bot 6 months ago

  • Copied to Backport #62890: pacific: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings added

#15 Updated by Backport Bot 6 months ago

  • Tags set to backport_processed

#16 Updated by Christian Rohmann 6 months ago

Kamoltat (Junior) Sirivadhna wrote:

https://github.com/ceph/ceph/pull/51921 should fix your last comment: https://tracker.ceph.com/issues/54136#note-8
Please verify if the fix worked, by maybe spinning up a new cluster with the latest ceph:main upstream branch and create the same pools as you did here.

Could you kindly point me to the easiest way to access builds of the current ceph:main branch?

#17 Updated by Kamoltat (Junior) Sirivadhna 6 months ago

If you are using docker by any chance,

you can pull the latest main build image: docker pull quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:main

else I think you can just pull it from: https://github.com/ceph/ceph.git and build it locally

#18 Updated by Christian Rohmann 6 months ago

Kamoltat (Junior) Sirivadhna wrote:

If you are using docker by any chance,
you can pull the latest main build image: docker pull quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:main

I suppose that this registry is not public?

else I think you can just pull it from: https://github.com/ceph/ceph.git and build it locally

Yeah, but building Ceph takes forever and is prone to config mistakes. That's why I rather wanted to use proper and already built dev packages if there are any.

#19 Updated by Kamoltat (Junior) Sirivadhna 6 months ago

Hi Christian,

I'm afraid I don't know of any other ways, I think your best bet is to build it locally, you can use cmake options `./do_cmake.sh -DWITH_MANPAGE=OFF -DWITH_BABELTRACE=OFF -DWITH_RADOSGW=OFF -DWITH_MGR_DASHBOARD_FRONTEND=OFF` to turn off unnecessary modules which will greatly speed up your build and use `ninja -j(#cores)` to build the modules using multi threads. Use vstart to start the cluster locally and you can test the change there.

#20 Updated by Kamoltat (Junior) Sirivadhna about 2 months ago

  • Status changed from Pending Backport to Resolved

All backports have been merged, if there are no objections I'm closing this tracker.

Also available in: Atom PDF