Bug #54136
pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings
0%
Description
I am debugging a mgr pg_autoscaler WARN which states a target_size_bytes on a pool would overcommit the available storage.
There is only one pool with value for target_size_bytes (=5T) defined and that apparently would consume more than the available storage:
# ceph health detail HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes [WRN] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED: 1 subtrees have overcommitted pool target_size_bytes Pools ['backups', 'images', 'device_health_metrics', '.rgw.root', 'redacted.rgw.control', 'redacted.rgw.meta', 'redacted.rgw.log', 'redacted.rgw.otp', 'redacted.rgw.buckets.index', 'redacted.rgw.buckets.data', 'redacted.rgw.buckets.non-ec'] overcommit available storage by 1.011x due to target_size_bytes 15.0T on pools ['redacted.rgw.buckets.data'].
But then looking at the actual usage it seems strange that 15T (5T * 3 replicas) should not fit onto the remaining 122 TiB AVAIL:
# ceph df detail --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 293 TiB 122 TiB 171 TiB 171 TiB 58.44 TOTAL 293 TiB 122 TiB 171 TiB 171 TiB 58.44 --- POOLS --- POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR backups 1 1024 92 TiB 92 TiB 3.8 MiB 28.11M 156 TiB 156 TiB 11 MiB 64.77 28 TiB N/A N/A N/A 39 TiB 123 TiB images 2 64 1.7 TiB 1.7 TiB 249 KiB 471.72k 5.2 TiB 5.2 TiB 748 KiB 5.81 28 TiB N/A N/A N/A 0 B 0 B device_health_metrics 19 1 82 MiB 0 B 82 MiB 43 245 MiB 0 B 245 MiB 0 28 TiB N/A N/A N/A 0 B 0 B .rgw.root 21 32 23 KiB 23 KiB 0 B 25 4.1 MiB 4.1 MiB 0 B 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.control 22 32 0 B 0 B 0 B 8 0 B 0 B 0 B 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.meta 23 32 1.7 MiB 394 KiB 1.3 MiB 1.38k 237 MiB 233 MiB 3.9 MiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.log 24 32 53 MiB 500 KiB 53 MiB 7.60k 204 MiB 47 MiB 158 MiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.otp 25 32 5.2 KiB 0 B 5.2 KiB 0 16 KiB 0 B 16 KiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.buckets.index 26 32 1.2 GiB 0 B 1.2 GiB 7.46k 3.5 GiB 0 B 3.5 GiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.buckets.data 27 128 3.1 TiB 3.1 TiB 0 B 3.53M 9.5 TiB 9.5 TiB 0 B 10.11 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.buckets.non-ec 28 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 28 TiB N/A N/A N/A 0 B 0 B
I then looked at how those values are determined at https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L509.
Apparently "total_bytes" are compared with the capacity of the root_map. I added a debug line and found that the total in my cluster was already at:
total=325511007759696
so in excess of 300 TiB - Looking at "ceph df" again this usage seems strange.
Looking at how this total is calculated at https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L441,
you see that the larger value (max) of "actual_raw_used" vs. "target_bytes*raw_used_rate" is determined and then summed up.
I dumped the values for all pools my cluster with yet another line of debug code:
pool_id 1 - actual_raw_used=303160109187420.0, target_bytes=0 raw_used_rate=3.0 pool_id 2 - actual_raw_used=5714098884702.0, target_bytes=0 raw_used_rate=3.0 pool_id 19 - actual_raw_used=256550760.0, target_bytes=0 raw_used_rate=3.0 pool_id 21 - actual_raw_used=71433.0, target_bytes=0 raw_used_rate=3.0 pool_id 22 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0 pool_id 23 - actual_raw_used=5262798.0, target_bytes=0 raw_used_rate=3.0 pool_id 24 - actual_raw_used=162299940.0, target_bytes=0 raw_used_rate=3.0 pool_id 25 - actual_raw_used=16083.0, target_bytes=0 raw_used_rate=3.0 pool_id 26 - actual_raw_used=3728679936.0, target_bytes=0 raw_used_rate=3.0 pool_id 27 - actual_raw_used=10035209699328.0, target_bytes=5497558138880 raw_used_rate=3.0 pool_id 28 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0
All values but those of pool_id 1 (backups) make sense. For backups it's just reporting a MUCH larger actual_raw_used value than what is shown via ceph df.
The only difference of that pool compared to the others is the enabled compression:
# ceph osd pool get backups compression_mode compression_mode: aggressive
Apparently there already was a similar issue (https://tracker.ceph.com/issues/41567) with a resulting commit (https://github.com/ceph/ceph/commit/dd6e752826bc762095be4d276e3c1b8d31293eb0)
changing which from "bytes_used" to the "stored" field for "pool_logical_used".
But how does that take compressed (away) data into account? Does "bytes_used" count all the "stored" bytes, summing up all uncompressed bytes for pools with compression?
This surely must be a bug then, as those bytes are not really "actual_raw_used".
Related issues
History
#1 Updated by Neha Ojha over 1 year ago
- Category set to pg_autoscaler module
- Assignee set to Kamoltat (Junior) Sirivadhna
#2 Updated by Christian Rohmann over 1 year ago
Thanks Kamoltat for looking into this issue. Please let me know if there are any more details I could provide.
#3 Updated by Christian Rohmann about 1 year ago
May I ask if you did get the chance to verify this?
#4 Updated by Kamoltat (Junior) Sirivadhna 10 months ago
Thank you so much for filing this issue and including all the details, I'm currently looking into this.
#5 Updated by Christian Rohmann 10 months ago
Kamoltat (Junior) Sirivadhna wrote:
Thank you so much for filing this issue and including all the details, I'm currently looking into this.
Awesome. Please let me know if there are any more detail that I could provide from our cluster to help you narrow this down.
#6 Updated by Kamoltat (Junior) Sirivadhna 10 months ago
- Pull request ID set to 51921
#7 Updated by Kamoltat (Junior) Sirivadhna 10 months ago
- Status changed from New to Fix Under Review
#8 Updated by Christian Rohmann 9 months ago
Thanks again for the proposed fix.
I am wondering about yet another value in the POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warning, the "target_size_bytes".
Here you go with the details of my cluster:
# ceph health detail [WRN] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED: 1 subtrees have overcommitted pool target_size_bytes Pools ['backups', 'images', '.mgr', '.rgw.root', 'redacted.rgw.control', 'redacted.rgw.meta', 'redacted.rgw.log', 'redacted.rgw.otp', 'redacted.rgw.buckets.index', 'redacted.rgw.buckets.data', 'redacted.rgw.buckets.non-ec'] overcommit available storage by 1.114x due to target_size_bytes 255T on pools ['backups', 'redacted.rgw.buckets.data'] # ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK backups 105.7T 76800G 3.0 312.9T 1.0138 1.0 1024 on False images 414.8G 3.0 312.9T 0.0039 1.0 64 on False .mgr 438.9M 3.0 312.9T 0.0000 1.0 1 on False .rgw.root 37941 3.0 312.9T 0.0000 1.0 32 on False redacted.rgw.control 0 3.0 312.9T 0.0000 1.0 32 on False redacted.rgw.meta 744.5k 3.0 312.9T 0.0000 1.0 32 on False redacted.rgw.log 236.0M 3.0 312.9T 0.0000 1.0 32 on False redacted.rgw.otp 1394 3.0 312.9T 0.0000 1.0 32 on False redacted.rgw.buckets.index 701.6M 3.0 312.9T 0.0000 1.0 32 on False redacted.rgw.buckets.data 9866G 10240G 3.0 312.9T 0.0959 1.0 32 on False redacted.rgw.buckets.non-ec 0 3.0 312.9T 0.0000 1.0 32 on False # ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 313 TiB 101 TiB 212 TiB 212 TiB 67.73 TOTAL 313 TiB 101 TiB 212 TiB 212 TiB 67.73 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL backups 1 1024 106 TiB 31.66M 180 TiB 79.54 15 TiB images 2 64 415 GiB 78.70k 1.2 TiB 2.56 15 TiB .mgr 19 1 439 MiB 83 1.3 GiB 0 15 TiB .rgw.root 21 32 37 KiB 31 4.9 MiB 0 15 TiB redacted.rgw.control 22 32 0 B 8 0 B 0 15 TiB redacted.rgw.meta 23 32 745 KiB 2.55k 322 MiB 0 15 TiB redacted.rgw.log 24 32 236 MiB 11.81k 853 MiB 0 15 TiB redacted.rgw.otp 25 32 1.4 KiB 0 4.1 KiB 0 15 TiB redacted.rgw.buckets.index 26 32 702 MiB 13.36k 2.1 GiB 0 15 TiB redacted.rgw.buckets.data 27 32 9.6 TiB 11.16M 30 TiB 39.35 15 TiB redacted.rgw.buckets.non-ec 28 32 0 B 0 0 B 0 15 TiB
As you can see it complains about over-commitment "due to target_size_bytes 255T on pools".
But the sum of the target sizes is actually only 1/3 of 255T, apparently due to the replication factor.
#9 Updated by Kamoltat (Junior) Sirivadhna 6 months ago
https://github.com/ceph/ceph/pull/51921 should fix your last comment: https://tracker.ceph.com/issues/54136#note-8
Please verify if the fix worked, by maybe spinning up a new cluster with the latest ceph:main upstream branch and create the same pools as you did here.
#10 Updated by Kamoltat (Junior) Sirivadhna 6 months ago
I'll leave the tracker open for a while
#11 Updated by Kamoltat (Junior) Sirivadhna 6 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to reef, quincy, pacific
#12 Updated by Backport Bot 6 months ago
- Copied to Backport #62888: quincy: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings added
#13 Updated by Backport Bot 6 months ago
- Copied to Backport #62889: reef: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings added
#14 Updated by Backport Bot 6 months ago
- Copied to Backport #62890: pacific: pg_autoscaler counting pools uncompressed bytes as total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings added
#15 Updated by Backport Bot 6 months ago
- Tags set to backport_processed
#16 Updated by Christian Rohmann 6 months ago
Kamoltat (Junior) Sirivadhna wrote:
https://github.com/ceph/ceph/pull/51921 should fix your last comment: https://tracker.ceph.com/issues/54136#note-8
Please verify if the fix worked, by maybe spinning up a new cluster with the latest ceph:main upstream branch and create the same pools as you did here.
Could you kindly point me to the easiest way to access builds of the current ceph:main branch?
#17 Updated by Kamoltat (Junior) Sirivadhna 6 months ago
If you are using docker by any chance,
you can pull the latest main build image: docker pull quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:main
else I think you can just pull it from: https://github.com/ceph/ceph.git and build it locally
#18 Updated by Christian Rohmann 6 months ago
Kamoltat (Junior) Sirivadhna wrote:
If you are using docker by any chance,
you can pull the latest main build image: docker pull quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:main
I suppose that this registry is not public?
else I think you can just pull it from: https://github.com/ceph/ceph.git and build it locally
Yeah, but building Ceph takes forever and is prone to config mistakes. That's why I rather wanted to use proper and already built dev packages if there are any.
#19 Updated by Kamoltat (Junior) Sirivadhna 6 months ago
Hi Christian,
I'm afraid I don't know of any other ways, I think your best bet is to build it locally, you can use cmake options `./do_cmake.sh -DWITH_MANPAGE=OFF -DWITH_BABELTRACE=OFF -DWITH_RADOSGW=OFF -DWITH_MGR_DASHBOARD_FRONTEND=OFF` to turn off unnecessary modules which will greatly speed up your build and use `ninja -j(#cores)` to build the modules using multi threads. Use vstart to start the cluster locally and you can test the change there.
#20 Updated by Kamoltat (Junior) Sirivadhna about 2 months ago
- Status changed from Pending Backport to Resolved
All backports have been merged, if there are no objections I'm closing this tracker.