Bug #43586
mgr/balancer reports "Unable to find further optimization ...", but distribution is not perfect
0%
Description
Hi,
I'm running ceph-mgr 14.2.6 with balancer enabled.
This is the status of ceph balancer:
root@ld3955:~# date && time ceph balancer status
Mon Jan 13 11:00:09 CET 2020
{
"last_optimize_duration": "0:02:30.984016",
"plans": [],
"mode": "upmap",
"active": true,
"optimize_result": "in progress",
"last_optimize_started": "Mon Jan 13 10:58:35 2020"
}
real 0m0,380s
user 0m0,216s
sys 0m0,031s
root@ld3955:~# date && time ceph balancer status
Mon Jan 13 14:06:20 CET 2020
{
"last_optimize_duration": "0:02:32.787459",
"plans": [],
"mode": "upmap",
"active": true,
"optimize_result": "in progress",
"last_optimize_started": "Mon Jan 13 14:04:37 2020"
}
real 0m0,987s
user 0m0,228s
sys 0m0,027s
root@ld3955:~# date && time ceph balancer status
Mon Jan 13 15:19:48 CET 2020
{
"last_optimize_duration": "0:02:33.119116",
"plans": [],
"mode": "upmap",
"active": true,
"optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect",
"last_optimize_started": "Mon Jan 13 15:16:20 2020"
}
real 0m0,268s
user 0m0,220s
sys 0m0,025s
Since release 14.2.6 the command returns output within seconds (before it was minutes).
However, this output
"optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect"
is not accurate.
The data distribution of 1.6TB disks is extremely unbalanced:
root@ld3955:~# ceph osd df class hdd-strgbx | awk '{ print "osd."$1, "size: "$5, "usage: " $17, "reweight: "$4 }' | sort -nk5 | grep -v 7.3 | head
osd.ID size: SIZE usage: reweight: REWEIGHT
osd.MIN/MAX size: 12.42 usage: reweight: STDDEV:
osd.TOTAL size: TiB usage: reweight: 727
osd.205 size: 1.6 usage: 53.26 reweight: 1.00000
osd.100 size: 1.6 usage: 53.38 reweight: 1.00000
osd.243 size: 1.6 usage: 53.40 reweight: 1.00000
osd.255 size: 1.6 usage: 54.11 reweight: 1.00000
osd.154 size: 1.6 usage: 54.14 reweight: 1.00000
osd.106 size: 1.6 usage: 54.19 reweight: 1.00000
osd.262 size: 1.6 usage: 54.20 reweight: 1.00000
root@ld3955:~# ceph osd df class hdd-strgbx | awk '{ print "osd."$1, "size: "$5, "usage: " $17, "reweight: "$4 }' | sort -nk5 | grep -v 7.3 | tail
osd.237 size: 1.6 usage: 77.80 reweight: 1.00000
osd.250 size: 1.6 usage: 77.81 reweight: 0.89999
osd.124 size: 1.6 usage: 77.89 reweight: 1.00000
osd.216 size: 1.6 usage: 78.45 reweight: 1.00000
osd.50 size: 1.6 usage: 78.49 reweight: 0.89999
osd.101 size: 1.6 usage: 78.72 reweight: 1.00000
osd.105 size: 1.6 usage: 79.20 reweight: 1.00000
osd.136 size: 1.6 usage: 79.47 reweight: 1.00000
osd.204 size: 1.6 usage: 80.24 reweight: 1.00000
osd.264 size: 1.6 usage: 83.17 reweight: 1.00000
This is the relevant ceph-mgr/balancer configuration:
root@ld3955:~# ceph config-key dump | grep balancer | grep -v config-history
"config/mgr/mgr/balancer/active": "true",
"config/mgr/mgr/balancer/mode": "upmap",
"config/mgr/mgr/balancer/pool_ids": "11",
"config/mgr/mgr/balancer/upmap_max_iterations": "20",
"mgr/balancer/max_misplaced:": "0.01",
"mgr/balancer/upmap_max_iterations": "20",
Executing manual optimization with osdmaptool returns the same result:
2020-01-08 17:30:12.398 7fbae3123ac0 10 failed to find any changes for overfull osds
2020-01-08 17:30:12.398 7fbae3123ac0 10 failed to find any changes for underfull osds
2020-01-08 17:30:12.398 7fbae3123ac0 10 break due to not being able to find any further optimizations
2020-01-08 17:30:12.402 7fbae3123ac0 10 num_changed = 0
no upmaps proposed
Considering these facts I would conclude that balancer's algorithm is not working properly.
Related issues
History
#1 Updated by Neha Ojha almost 4 years ago
- Assignee set to David Zafman
#2 Updated by David Zafman almost 4 years ago
I don't think you should have any reweight set in combination with upmap balancing.
Can you attach a copy of your OSDMap? (ceph osd getmap > myosdmap)
#3 Updated by Thomas Schneider almost 4 years ago
- File cmonty14_osdmap added
Hi David,
I do agree with your recommendation regarding setting reweight, however I don't know of any other option that would allow me to set reweight of osd.50 to 1.0 without putting the pool in a non-usable state because of full OSD.
Can I manually move PGs from one OSD to another?
OSDMap "cmonty14_osdmap" is attached.
Regards
Thomas
#4 Updated by Dan van der Ster almost 4 years ago
Each time the balancer runs it picks a random pool to try to calculate the upmaps.
I tested a few pools from your osdmap and it is able to find some improvements:
# osdmaptool cmonty14_osdmap --upmap - --upmap-pool hdd osdmaptool: osdmap file 'cmonty14_osdmap' checking for upmap cleanups upmap, max-count 100, max deviation 0.01 limiting to pools hdd (59) ceph osd rm-pg-upmap-items 59.171 ceph osd rm-pg-upmap-items 59.244 ceph osd rm-pg-upmap-items 59.2c8 ceph osd rm-pg-upmap-items 59.3f4
# osdmaptool cmonty14_osdmap --upmap - --upmap-pool ssd osdmaptool: osdmap file 'cmonty14_osdmap' checking for upmap cleanups upmap, max-count 100, max deviation 0.01 limiting to pools ssd (66) ceph osd rm-pg-upmap-items 66.9 ceph osd rm-pg-upmap-items 66.6d ceph osd rm-pg-upmap-items 66.9f ceph osd rm-pg-upmap-items 66.152 ceph osd rm-pg-upmap-items 66.1b1 ceph osd rm-pg-upmap-items 66.300 ceph osd rm-pg-upmap-items 66.347 ceph osd rm-pg-upmap-items 66.39c ceph osd rm-pg-upmap-items 66.3ee ceph osd pg-upmap-items 66.22 11 12 ceph osd pg-upmap-items 66.127 11 12 ceph osd pg-upmap-items 66.1fd 11 12 ceph osd pg-upmap-items 66.347 13 12
However, it's veeeery slow for your massive pool, spinning 100% cpu long enough for me to ctrl-c:
# osdmaptool cmonty14_osdmap --upmap - --upmap-pool hdb_backup osdmaptool: osdmap file 'cmonty14_osdmap' checking for upmap cleanups upmap, max-count 100, max deviation 0.01 limiting to pools hdb_backup (11) ...
So I think this is the root problem here. I expect that you probably have ceph-mgr's hanging, failing over, when the balancer is enabled?
If I use --osd_calc_pg_upmaps_aggressively=0, then the osdmaptool upmap finishes quickly but with `no upmaps proposed`.
You can run `osdmaptool --debug_osd=20 --upmap - --upmap-pool hdb_backup` to see the upmap debug output.
This needs a look to see why the heuristic is so cpu intensive for your map.
#5 Updated by Thomas Schneider almost 4 years ago
Hi,
it's true that pools hdd and ssd can be optimized.
However these pools can be neglected compared to pool hdb_backup with regards to number of PGs, utilisation, etc.
And actually I don't have any issues with ceph-mgr since upgrade to 14.2.6 caused by balancer.
My assumption is that this issue is related to the OSDs that are "assigned" to pool hdb_backup.
There are
48 + 48 + 48 OSDs à 8.00TB = 1152TB
48 + 48 + 48 + 48 OSDs à 1.80TB = 345.6TB
and this is friendly speaking not really an equal distribution.
However I was expecting that balancer can handle this with intelligent algorithm.
#6 Updated by Jonas Jelten almost 4 years ago
I can confirm that something is off there. One of my clusters running 14.2.2 has perfectly balanced pools on 2, 3, 8 and 12T-OSDs. A different cluster running 14.2.6 with more hosts and more OSDs of 1, 3, 4, 8 and 12T is very unbalanced.
Both have enough PGs, but the second cluster is nearly full because some OSDs are at 89% capacity, while others are at 37%. Any advice how the balancer can be motivated to actually balance? :)
#7 Updated by David Zafman almost 4 years ago
Jonas Jelten wrote:
I can confirm that something is off there. One of my clusters running 14.2.2 has perfectly balanced pools on 2, 3, 8 and 12T-OSDs. A different cluster running 14.2.6 with more hosts and more OSDs of 1, 3, 4, 8 and 12T is very unbalanced.
Both have enough PGs, but the second cluster is nearly full because some OSDs are at 89% capacity, while others are at 37%. Any advice how the balancer can be motivated to actually balance? :)
A backport of an update to the balancer will appear in v14.2.7. The balancer doesn't balance capacity but rather the number of PGs. It is always possible with random distribution of objects of different size that some OSDs will use more space than others. Hopefully, not to the degree you are seeing if the PGs are balanced.
#8 Updated by David Zafman almost 4 years ago
Thomas Schneider wrote:
Hi David,
I do agree with your recommendation regarding setting reweight, however I don't know of any other option that would allow me to set reweight of osd.50 to 1.0 without putting the pool in a non-usable state because of full OSD.
Can I manually move PGs from one OSD to another?OSDMap "cmonty14_osdmap" is attached.
What I meant was the "reweight_set" values which are used by crush-compat mode balancing should be cleared. But I'm not sure that is really necessary.
I ran your osdmap against the latest code. It took 255 rounds with 10 upmap changes each to balance within 5 PGs. The update will appear in v14.2.7.
If you are or have used crush-compat balancing you could try switching the mode to upmap once you install v14.2.7. Watch closely to see if the upmaps are working. You could simulate the behavior by running the osdmaptool to see will happen:
osdmaptool --upmap cmonty.out --upmap-active cmonty14_osdmap
cmonty.out will contain the requested upmap changes and other information will go to stdout.
The longest computation was 1.26764 seconds, so the ceph-mgr shouldn't be too cpu-bound since that only happens once a minute.
#9 Updated by Thomas Schneider almost 4 years ago
Hi,
I have enabled crush-upmap at a very early stage, however I cannot exclude that some OSDs still use crush-compat.
Please advise how to verify this.
Then I finished execution of
osdmaptool cmonty14_osdmap --upmap out_hdb_backup_$(date +"%d-%m-%Y_%H-%M") --upmap-pool hdb_backup --debug-osd 20
with this output:
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.3e86
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.f50
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.3562
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.3955
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.25f2
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.1025
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.e13
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.ab4
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.34f9
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.b70
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.4b7
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.25c7
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 trying 11.1815
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 osd.400 target 262.934 deviation 0.0664368 -> ratio 0.000252675 < max ratio 0.01
2020-01-24 08:51:23.119 7f4da6ab4ac0 10 failed to find any changes for overfull osds
2020-01-24 08:51:23.123 7f4da6ab4ac0 10 osd.321 target 262.934 deviation -1.93356 -> absolute ratio 0.00735381 < max ratio 0.01
2020-01-24 08:51:23.123 7f4da6ab4ac0 10 failed to find any changes for underfull osds
2020-01-24 08:51:23.123 7f4da6ab4ac0 10 break due to not being able to find any further optimizations
2020-01-24 08:51:23.127 7f4da6ab4ac0 10 num_changed = 0
no upmaps proposed
Fri Jan 24 08:51:23 CET 2020
I disabled balancer before!
root@ld3955:~# date && time ceph balancer status
Fri Jan 24 08:47:42 CET 2020
{
"last_optimize_duration": "0:02:32.995319",
"plans": [],
"mode": "upmap",
"active": false,
"optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect",
"last_optimize_started": "Fri Jan 24 08:45:04 2020"
}
real 0m0,280s
user 0m0,228s
sys 0m0,018s
#10 Updated by Dan van der Ster almost 4 years ago
Thomas Schneider wrote:
I have enabled crush-upmap at a very early stage, however I cannot exclude that some OSDs still use crush-compat.
Please advise how to verify this.
ceph osd crush weight-set ls
If you have one, then don't just do the `ceph osd crush weight-set rm-compat` because it will likely trigger a lot of data movement.
#11 Updated by Thomas Schneider almost 4 years ago
With regards to weight-set I remember that I executeded this command in the past:
ceph osd crush weight-set rm-compat
#12 Updated by Thomas Schneider almost 4 years ago
root@ld3955:~# time ceph osd crush weight-set ls && date
real 0m0,264s
user 0m0,192s
sys 0m0,049s
Fri Jan 24 09:23:32 CET 2020
There's no output.
#13 Updated by David Zafman almost 4 years ago
Thomas:
Unless you want to install a development build you'll need to wait for the v14.2.7 Nautilus release.
#14 Updated by Thomas Schneider almost 4 years ago
I can wait.
However, can you please share some details about the changes in v14.2.7 with regards to balancer?
I'm asking this because there was communication in the past that with every release v14.2.x the balancer issue will be fixed.
#15 Updated by David Zafman almost 4 years ago
Thomas Schneider wrote:
I can wait.
However, can you please share some details about the changes in v14.2.7 with regards to balancer?
I'm asking this because there was communication in the past that with every release v14.2.x the balancer issue will be fixed.
There is a bug fix which fixes CPU load and an inability sometimes for the algorithm to make progress. Some algorithm improvements like considering more low PG OSDs when trying to reduce high PG OSD. We also process one pool at a time, which avoids some spinning due to potential of large numbers of PGs. In addition, by default we only balance to within 5 PGs of the mean for each pool on each OSD.
#16 Updated by David Zafman almost 4 years ago
- Related to Bug #43752: Master tracker for upmap performance improvements added
#17 Updated by Nathan Cutler over 3 years ago
Thomas Schneider wrote:
I can wait.
However, can you please share some details about the changes in v14.2.7 with regards to balancer?
I'm asking this because there was communication in the past that with every release v14.2.x the balancer issue will be fixed.
I suppose the fixes @David referred to are in https://github.com/ceph/ceph/pull/31956 which went into the Nautilus v14.2.8 release.