Bug #53559
Balancer bug - very slow performance (minutes) in some cases
0%
Description
When running osdmaptool with the attached file, it works fine for --upmap-max values up to 15 (0.5 seconds) when --upmap-max is 16 it completes in 20 seconds, for 17 it is almost 2 minutes and I did not try further, for 25 it did not complete in 10 minutes. This was verified also against older stable version [14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable)]
The command line that creates the problem is
osdmaptool osdmap.GD.bin --upmap out.txt --upmap-deviation 1 --upmap-pool default.rgw.buckets.data --upmap-max 25
(you need upmap deviation 1 because otherwise there are less than 15 changes as this pool is relatively balanced)
History
#1 Updated by Dan van der Ster over 2 years ago
I haven't looked at your osdmap, but in our experience if a cluster is "impossible" to balance, e.g. highly non uniform sizes of failure domains, and it has many PGs, then the balancer takes a long time (because it tries very hard to find the impossible solution).
Maybe we should add a timeout...
#2 Updated by Josh Salomon over 2 years ago
I am not sure how "impossible" the situation is (I got this file and I did not dive into the configuration) but I succeeded in reducing the time it takes to complete considerably (from ~1 minute to 5 seconds for upmap-max 16 and from forever to ~40 seconds for upmap-max 25). I opened this issue as a first step for a PR.
There is a little cost - in some rare cases the results I get from the new code are a bit different than the reults with the old code. I am not sure this is significant, in some cases it is clear the solutions have the same value, in some cases I am not sure.
#3 Updated by Laura Flores over 2 years ago
- Status changed from New to In Progress
- Assignee set to Josh Salomon
#4 Updated by Dan van der Ster over 2 years ago
OK now I understand the context.
Anyway I had a look at the osdmap -- the cluster has only 3 racks, and they are not equal in size:
-86 1672.44946 root default -85 544.04980 rack SB02-07 -102 80.59998 chassis SB02-07-13 -81 80.59998 chassis SB02-07-17 -78 80.59998 chassis SB02-07-21 -75 60.44998 chassis SB02-07-34 -84 80.59998 chassis SB02-07-38 -132 80.59998 chassis SB02-07-5 -180 80.59998 chassis SB02-07-9 -71 564.19983 rack SB02-08 -325 20.14999 chassis SB02-08-1 -112 80.59998 chassis SB02-08-13 -64 80.59998 chassis SB02-08-17 -70 80.59998 chassis SB02-08-21 -67 80.59998 chassis SB02-08-34 -60 80.59998 chassis SB02-08-38 -142 80.59998 chassis SB02-08-5 -168 60.44998 chassis SB02-08-9 -57 564.19983 rack SB02-09 -122 80.59998 chassis SB02-09-13 -51 80.59998 chassis SB02-09-17 -56 80.59998 chassis SB02-09-21 -48 80.59998 chassis SB02-09-34 -54 80.59998 chassis SB02-09-38 -150 80.59998 chassis SB02-09-5 -190 80.59998 chassis SB02-09-9
And all the pools are size 3.
So this is an an example of "impossible" -- because of CRUSH basics, each rack will get the same number of PGs. However the balancer is going to try (very hard) to somehow have fewer PGs in rack SB02-07.
#5 Updated by Josh Salomon over 2 years ago
Thanks Dan, indeed the pool is with 16384 PGs an I understand why it takes so long (another pool in this file with 1024 pools completes on time) - Still I think that I have pretty simple fix which will make the balancer realize much faster that there is no more work to do...
Please note that the balancer (the existing one) still finds 15 changes to make quite easily, and 24 in total.
BTW - One other problem that exists and I am not sure is handled in the existing balancer is moving data within the failure domain between OSDs which are not full in the same level - this should be possible in almost all circumstances.
#6 Updated by Neha Ojha over 2 years ago
- Project changed from RADOS to mgr
- Category changed from Performance/Resource Usage to balancer module