Project

General

Profile

Bug #57105

quincy: ceph osd pool set <pool> size math error

Added by Brian Woods over 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Context, I created a pool with a block device and intentionally filled a set of OSDs.

This of course broke things, I deleted the pool with the image on it and then went through my deployment script to move onto my next test.

When attempting to set the pool size, it calculates the pgs incorrectly blocking the change.

root@backups:# ceph osd pool get BlockDevices-WriteCache size
size: 3
root@backups:# ceph osd pool set BlockDevices-WriteCache size 2
Error ERANGE: pool id 19 pg_num 1 size 2 would mean 18446744073709551615 total pgs, which exceeds max 750 (mon_max_pg_per_osd 250 * num_in_osds 3)

This is an empty pool with nothing special set:

CacheMinSize=2
Cache_max_bytes=5368709120        #5GB
Cache_target_max_objects=1024
Cache_target_dirty_ratio=0.2
Cache_target_dirty_high_ratio=0.5
Cache_target_full_ratio=0.6
Cache_min_flush_age=3000
Cache_min_evict_age=3000
CompressionMode=aggressive
CompressionAlgorithm=zstd

root@backups:# rados df
POOL_NAME                        USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS       RD  WR_OPS       WR  USED COMPR  UNDER COMPR
.mgr                          2.6 MiB        2       0       6                   0        0         0    1634  3.3 MiB    1419   25 MiB         0 B          0 B
.rgw.root                     2.5 KiB        6       0      12                   0        0         0     267  267 KiB       6    6 KiB         0 B          0 B
BlockDevices                      0 B        0       0       0                   0        0         0       0      0 B       0      0 B         0 B          0 B
BlockDevices-WriteCache           0 B        0       0       0                   0        0         0       0      0 B       0      0 B         0 B          0 B

root@backups:# ceph osd pool get BlockDevices-WriteCache min_size
min_size: 2
root@backups:# ceph osd pool get BlockDevices-WriteCache pg_num
pg_num: 1
root@backups:# ceph osd pool get BlockDevices-WriteCache pgp_num
pgp_num: 1

Related issues

Related to RADOS - Bug #58288: quincy: mon: pg_num_check() according to crush rule Resolved
Duplicated by RADOS - Bug #54188: Setting too many PGs leads error handling overflow Resolved

History

#1 Updated by Brian Woods over 1 year ago

Looks like one of the page groups is "inactive":

    health: HEALTH_WARN
...
            Reduced data availability: 1 pg inactive

And the GUI shoes the pools status as:

1 unknown

This is a brand new pool, nothing has been added to it other than tierig.

#2 Updated by Brian Woods over 1 year ago

So I thought this may have been because I re-used the name, so I went to create a pool with a different name to continue my tests, and it did the same thing.

Seems I have broken something good. __

#3 Updated by Patrick Donnelly over 1 year ago

  • Project changed from cephsqlite to RADOS

#4 Updated by Brian Woods over 1 year ago

I created a new cluster today to do a very specific test and ran into this (or something like it) again today. In this test I could only spare a single SSD for meta data (but I didn't care about redundancy), so after assigning one SSD to an OSD, and then attempting to create a pool I got this!

# ceph osd pool set CephFS-Meta size 1 --yes-i-really-mean-it
Error ERANGE: pool id 6 pg_num 8 size 1 would mean 18446744073709551608 total pgs, which exceeds max 750 (mon_max_pg_per_osd 250 * num_in_osds 3)

#5 Updated by Brian Woods over 1 year ago

Setting the size (from 3) to 2, then setting it to 1 works...

# ceph osd pool set CephFS-Meta size 1 --yes-i-really-mean-it
Error ERANGE: pool id 6 pg_num 8 size 1 would mean 18446744073709551608 total pgs, which exceeds max 750 (mon_max_pg_per_osd 250 * num_in_osds 3)
# ceph osd pool get CephFS-Meta size                                     
size: 3
# ceph osd pool set CephFS-Meta size 2 --yes-i-really-mean-it
set pool 6 size to 2
# ceph osd pool set CephFS-Meta size 1 --yes-i-really-mean-it
set pool 6 size to 1

#6 Updated by Brian Woods over 1 year ago

Looks like in both cases something is being subtracted from an zero value unsigned int64 and overflowing.

2^64 − 18,446,744,073,709,551,615 = 1
2^64 − 18,446,744,073,709,551,608 = 8

This was the same type of issue in a very old issue I opened here:
https://tracker.ceph.com/issues/22539

If I have encounters this in 3 separate places... A larger effort to check for overflows might not be a bad idea.

#7 Updated by Radoslaw Zarzynski over 1 year ago

  • Related to Bug #54188: Setting too many PGs leads error handling overflow added

#8 Updated by Radoslaw Zarzynski over 1 year ago

  • Assignee set to Kamoltat (Junior) Sirivadhna

#9 Updated by Matan Breizman about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 49507

This PR is proposed after a BZ was reporting the same issue.

#10 Updated by Matan Breizman about 1 year ago

  • Duplicated by Bug #54188: Setting too many PGs leads error handling overflow added

#11 Updated by Matan Breizman about 1 year ago

  • Related to deleted (Bug #54188: Setting too many PGs leads error handling overflow)

#12 Updated by Kamoltat (Junior) Sirivadhna about 1 year ago

  • Assignee changed from Kamoltat (Junior) Sirivadhna to Matan Breizman

#13 Updated by Matan Breizman about 1 year ago

  • Subject changed from ceph osd pool set <pool> size math error to quincy: ceph osd pool set <pool> size math error

This was fixed in main https://github.com/ceph/ceph/pull/44430 but was not backported to Q.
Instead of backporting the fix, a revert was pushed to address the underflow reported - https://github.com/ceph/ceph/pull/49465

(After the revert is merged, a fix will be backported as well. This is tracked here: https://tracker.ceph.com/issues/58288)

#14 Updated by Matan Breizman about 1 year ago

  • Related to Bug #58288: quincy: mon: pg_num_check() according to crush rule added

#15 Updated by Matan Breizman about 1 year ago

  • Pull request ID changed from 49507 to 49465

#17 Updated by Matan Breizman 12 months ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF