Project

General

Profile

Bug #53559

Balancer bug - very slow performance (minutes) in some cases

Added by Josh Salomon about 1 year ago. Updated about 1 year ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
balancer module
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When running osdmaptool with the attached file, it works fine for --upmap-max values up to 15 (0.5 seconds) when --upmap-max is 16 it completes in 20 seconds, for 17 it is almost 2 minutes and I did not try further, for 25 it did not complete in 10 minutes. This was verified also against older stable version [14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable)]
The command line that creates the problem is
osdmaptool osdmap.GD.bin --upmap out.txt --upmap-deviation 1 --upmap-pool default.rgw.buckets.data --upmap-max 25

(you need upmap deviation 1 because otherwise there are less than 15 changes as this pool is relatively balanced)

osdmap.GD.bin - the crush map file that creates the problem (231 KB) Josh Salomon, 12/09/2021 09:52 AM

History

#1 Updated by Dan van der Ster about 1 year ago

I haven't looked at your osdmap, but in our experience if a cluster is "impossible" to balance, e.g. highly non uniform sizes of failure domains, and it has many PGs, then the balancer takes a long time (because it tries very hard to find the impossible solution).
Maybe we should add a timeout...

#2 Updated by Josh Salomon about 1 year ago

I am not sure how "impossible" the situation is (I got this file and I did not dive into the configuration) but I succeeded in reducing the time it takes to complete considerably (from ~1 minute to 5 seconds for upmap-max 16 and from forever to ~40 seconds for upmap-max 25). I opened this issue as a first step for a PR.
There is a little cost - in some rare cases the results I get from the new code are a bit different than the reults with the old code. I am not sure this is significant, in some cases it is clear the solutions have the same value, in some cases I am not sure.

#3 Updated by Laura Flores about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Josh Salomon

#4 Updated by Dan van der Ster about 1 year ago

OK now I understand the context.

Anyway I had a look at the osdmap -- the cluster has only 3 racks, and they are not equal in size:

 -86         1672.44946  root default                                                  
 -85          544.04980      rack SB02-07                                              
-102           80.59998          chassis SB02-07-13                                    
 -81           80.59998          chassis SB02-07-17                                    
 -78           80.59998          chassis SB02-07-21                                    
 -75           60.44998          chassis SB02-07-34                                    
 -84           80.59998          chassis SB02-07-38                                    
-132           80.59998          chassis SB02-07-5                                     
-180           80.59998          chassis SB02-07-9                                     
 -71          564.19983      rack SB02-08                                              
-325           20.14999          chassis SB02-08-1                                     
-112           80.59998          chassis SB02-08-13                                    
 -64           80.59998          chassis SB02-08-17                                    
 -70           80.59998          chassis SB02-08-21                                    
 -67           80.59998          chassis SB02-08-34                                    
 -60           80.59998          chassis SB02-08-38                                    
-142           80.59998          chassis SB02-08-5                                     
-168           60.44998          chassis SB02-08-9                                     
 -57          564.19983      rack SB02-09                                              
-122           80.59998          chassis SB02-09-13                                    
 -51           80.59998          chassis SB02-09-17                                    
 -56           80.59998          chassis SB02-09-21                                    
 -48           80.59998          chassis SB02-09-34                                    
 -54           80.59998          chassis SB02-09-38                                    
-150           80.59998          chassis SB02-09-5                                     
-190           80.59998          chassis SB02-09-9                                     

And all the pools are size 3.

So this is an an example of "impossible" -- because of CRUSH basics, each rack will get the same number of PGs. However the balancer is going to try (very hard) to somehow have fewer PGs in rack SB02-07.

#5 Updated by Josh Salomon about 1 year ago

Thanks Dan, indeed the pool is with 16384 PGs an I understand why it takes so long (another pool in this file with 1024 pools completes on time) - Still I think that I have pretty simple fix which will make the balancer realize much faster that there is no more work to do...
Please note that the balancer (the existing one) still finds 15 changes to make quite easily, and 24 in total.
BTW - One other problem that exists and I am not sure is handled in the existing balancer is moving data within the failure domain between OSDs which are not full in the same level - this should be possible in almost all circumstances.

#6 Updated by Neha Ojha about 1 year ago

  • Project changed from RADOS to mgr
  • Category changed from Performance/Resource Usage to balancer module

Also available in: Atom PDF