Project

General

Profile

Bug #43586

mgr/balancer reports "Unable to find further optimization ...", but distribution is not perfect

Added by Thomas Schneider 3 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
balancer module
Target version:
% Done:

0%

Source:
Tags:
balancer
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Hi,
I'm running ceph-mgr 14.2.6 with balancer enabled.
This is the status of ceph balancer:

root@ld3955:~# date && time ceph balancer status
Mon Jan 13 11:00:09 CET 2020
{
    "last_optimize_duration": "0:02:30.984016",
    "plans": [],
    "mode": "upmap",
    "active": true,
    "optimize_result": "in progress",
    "last_optimize_started": "Mon Jan 13 10:58:35 2020" 
}

real    0m0,380s
user    0m0,216s
sys     0m0,031s
root@ld3955:~# date && time ceph balancer status
Mon Jan 13 14:06:20 CET 2020
{
    "last_optimize_duration": "0:02:32.787459",
    "plans": [],
    "mode": "upmap",
    "active": true,
    "optimize_result": "in progress",
    "last_optimize_started": "Mon Jan 13 14:04:37 2020" 
}

real    0m0,987s
user    0m0,228s
sys     0m0,027s
root@ld3955:~# date && time ceph balancer status
Mon Jan 13 15:19:48 CET 2020
{
    "last_optimize_duration": "0:02:33.119116",
    "plans": [],
    "mode": "upmap",
    "active": true,
    "optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect",
    "last_optimize_started": "Mon Jan 13 15:16:20 2020" 
}

real    0m0,268s
user    0m0,220s
sys     0m0,025s

Since release 14.2.6 the command returns output within seconds (before it was minutes).

However, this output
"optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect"
is not accurate.

The data distribution of 1.6TB disks is extremely unbalanced:

root@ld3955:~# ceph osd df class hdd-strgbx  | awk '{ print "osd."$1, "size: "$5, "usage: " $17, "reweight: "$4 }' | sort -nk5 | grep -v 7.3 | head
osd.ID size: SIZE usage:  reweight: REWEIGHT
osd.MIN/MAX size: 12.42 usage:  reweight: STDDEV:
osd.TOTAL size: TiB usage:  reweight: 727
osd.205 size: 1.6 usage: 53.26 reweight: 1.00000
osd.100 size: 1.6 usage: 53.38 reweight: 1.00000
osd.243 size: 1.6 usage: 53.40 reweight: 1.00000
osd.255 size: 1.6 usage: 54.11 reweight: 1.00000
osd.154 size: 1.6 usage: 54.14 reweight: 1.00000
osd.106 size: 1.6 usage: 54.19 reweight: 1.00000
osd.262 size: 1.6 usage: 54.20 reweight: 1.00000
root@ld3955:~# ceph osd df class hdd-strgbx  | awk '{ print "osd."$1, "size: "$5, "usage: " $17, "reweight: "$4 }' | sort -nk5 | grep -v 7.3 | tail
osd.237 size: 1.6 usage: 77.80 reweight: 1.00000
osd.250 size: 1.6 usage: 77.81 reweight: 0.89999
osd.124 size: 1.6 usage: 77.89 reweight: 1.00000
osd.216 size: 1.6 usage: 78.45 reweight: 1.00000
osd.50 size: 1.6 usage: 78.49 reweight: 0.89999
osd.101 size: 1.6 usage: 78.72 reweight: 1.00000
osd.105 size: 1.6 usage: 79.20 reweight: 1.00000
osd.136 size: 1.6 usage: 79.47 reweight: 1.00000
osd.204 size: 1.6 usage: 80.24 reweight: 1.00000
osd.264 size: 1.6 usage: 83.17 reweight: 1.00000

This is the relevant ceph-mgr/balancer configuration:

root@ld3955:~# ceph config-key dump | grep balancer | grep -v config-history
    "config/mgr/mgr/balancer/active": "true",
    "config/mgr/mgr/balancer/mode": "upmap",
    "config/mgr/mgr/balancer/pool_ids": "11",
    "config/mgr/mgr/balancer/upmap_max_iterations": "20",
    "mgr/balancer/max_misplaced:": "0.01",
    "mgr/balancer/upmap_max_iterations": "20",

Executing manual optimization with osdmaptool returns the same result:

2020-01-08 17:30:12.398 7fbae3123ac0 10  failed to find any changes for overfull osds
2020-01-08 17:30:12.398 7fbae3123ac0 10  failed to find any changes for underfull osds
2020-01-08 17:30:12.398 7fbae3123ac0 10  break due to not being able to find any further optimizations
2020-01-08 17:30:12.402 7fbae3123ac0 10  num_changed = 0
no upmaps proposed

Considering these facts I would conclude that balancer's algorithm is not working properly.

cmonty14_osdmap (200 KB) Thomas Schneider, 01/16/2020 08:22 AM


Related issues

Related to RADOS - Bug #43752: Master tracker for upmap performance improvements New

History

#1 Updated by Neha Ojha 3 months ago

  • Assignee set to David Zafman

#2 Updated by David Zafman 3 months ago

I don't think you should have any reweight set in combination with upmap balancing.

Can you attach a copy of your OSDMap? (ceph osd getmap > myosdmap)

#3 Updated by Thomas Schneider 3 months ago

Hi David,

I do agree with your recommendation regarding setting reweight, however I don't know of any other option that would allow me to set reweight of osd.50 to 1.0 without putting the pool in a non-usable state because of full OSD.
Can I manually move PGs from one OSD to another?

OSDMap "cmonty14_osdmap" is attached.

Regards
Thomas

#4 Updated by Dan van der Ster 2 months ago

Each time the balancer runs it picks a random pool to try to calculate the upmaps.

I tested a few pools from your osdmap and it is able to find some improvements:

# osdmaptool cmonty14_osdmap --upmap - --upmap-pool hdd
osdmaptool: osdmap file 'cmonty14_osdmap'
checking for upmap cleanups
upmap, max-count 100, max deviation 0.01
 limiting to pools hdd (59)
ceph osd rm-pg-upmap-items 59.171
ceph osd rm-pg-upmap-items 59.244
ceph osd rm-pg-upmap-items 59.2c8
ceph osd rm-pg-upmap-items 59.3f4
# osdmaptool cmonty14_osdmap --upmap - --upmap-pool ssd
osdmaptool: osdmap file 'cmonty14_osdmap'
checking for upmap cleanups
upmap, max-count 100, max deviation 0.01
 limiting to pools ssd (66)
ceph osd rm-pg-upmap-items 66.9
ceph osd rm-pg-upmap-items 66.6d
ceph osd rm-pg-upmap-items 66.9f
ceph osd rm-pg-upmap-items 66.152
ceph osd rm-pg-upmap-items 66.1b1
ceph osd rm-pg-upmap-items 66.300
ceph osd rm-pg-upmap-items 66.347
ceph osd rm-pg-upmap-items 66.39c
ceph osd rm-pg-upmap-items 66.3ee
ceph osd pg-upmap-items 66.22 11 12
ceph osd pg-upmap-items 66.127 11 12
ceph osd pg-upmap-items 66.1fd 11 12
ceph osd pg-upmap-items 66.347 13 12

However, it's veeeery slow for your massive pool, spinning 100% cpu long enough for me to ctrl-c:

# osdmaptool cmonty14_osdmap --upmap - --upmap-pool hdb_backup
osdmaptool: osdmap file 'cmonty14_osdmap'
checking for upmap cleanups
upmap, max-count 100, max deviation 0.01
 limiting to pools hdb_backup (11)
...

So I think this is the root problem here. I expect that you probably have ceph-mgr's hanging, failing over, when the balancer is enabled?

If I use --osd_calc_pg_upmaps_aggressively=0, then the osdmaptool upmap finishes quickly but with `no upmaps proposed`.

You can run `osdmaptool --debug_osd=20 --upmap - --upmap-pool hdb_backup` to see the upmap debug output.
This needs a look to see why the heuristic is so cpu intensive for your map.

#5 Updated by Thomas Schneider 2 months ago

Hi,
it's true that pools hdd and ssd can be optimized.
However these pools can be neglected compared to pool hdb_backup with regards to number of PGs, utilisation, etc.

And actually I don't have any issues with ceph-mgr since upgrade to 14.2.6 caused by balancer.

My assumption is that this issue is related to the OSDs that are "assigned" to pool hdb_backup.
There are
48 + 48 + 48 OSDs à 8.00TB = 1152TB
48 + 48 + 48 + 48 OSDs à 1.80TB = 345.6TB
and this is friendly speaking not really an equal distribution.
However I was expecting that balancer can handle this with intelligent algorithm.

#6 Updated by Jonas Jelten 2 months ago

I can confirm that something is off there. One of my clusters running 14.2.2 has perfectly balanced pools on 2, 3, 8 and 12T-OSDs. A different cluster running 14.2.6 with more hosts and more OSDs of 1, 3, 4, 8 and 12T is very unbalanced.

Both have enough PGs, but the second cluster is nearly full because some OSDs are at 89% capacity, while others are at 37%. Any advice how the balancer can be motivated to actually balance? :)

#7 Updated by David Zafman 2 months ago

Jonas Jelten wrote:

I can confirm that something is off there. One of my clusters running 14.2.2 has perfectly balanced pools on 2, 3, 8 and 12T-OSDs. A different cluster running 14.2.6 with more hosts and more OSDs of 1, 3, 4, 8 and 12T is very unbalanced.

Both have enough PGs, but the second cluster is nearly full because some OSDs are at 89% capacity, while others are at 37%. Any advice how the balancer can be motivated to actually balance? :)

A backport of an update to the balancer will appear in v14.2.7. The balancer doesn't balance capacity but rather the number of PGs. It is always possible with random distribution of objects of different size that some OSDs will use more space than others. Hopefully, not to the degree you are seeing if the PGs are balanced.

#8 Updated by David Zafman 2 months ago

Thomas Schneider wrote:

Hi David,

I do agree with your recommendation regarding setting reweight, however I don't know of any other option that would allow me to set reweight of osd.50 to 1.0 without putting the pool in a non-usable state because of full OSD.
Can I manually move PGs from one OSD to another?

OSDMap "cmonty14_osdmap" is attached.

What I meant was the "reweight_set" values which are used by crush-compat mode balancing should be cleared. But I'm not sure that is really necessary.

I ran your osdmap against the latest code. It took 255 rounds with 10 upmap changes each to balance within 5 PGs. The update will appear in v14.2.7.

If you are or have used crush-compat balancing you could try switching the mode to upmap once you install v14.2.7. Watch closely to see if the upmaps are working. You could simulate the behavior by running the osdmaptool to see will happen:

osdmaptool --upmap cmonty.out --upmap-active cmonty14_osdmap

cmonty.out will contain the requested upmap changes and other information will go to stdout.

The longest computation was 1.26764 seconds, so the ceph-mgr shouldn't be too cpu-bound since that only happens once a minute.

#9 Updated by Thomas Schneider 2 months ago

Hi,
I have enabled crush-upmap at a very early stage, however I cannot exclude that some OSDs still use crush-compat.
Please advise how to verify this.

Then I finished execution of

osdmaptool cmonty14_osdmap --upmap out_hdb_backup_$(date +"%d-%m-%Y_%H-%M") --upmap-pool hdb_backup --debug-osd 20

with this output:
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.3e86
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.f50
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.3562
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.3955
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.25f2
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.1025
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.e13
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.ab4
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.34f9
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.b70
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.4b7
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.25c7
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  trying 11.1815
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  osd.400 target 262.934 deviation 0.0664368 -> ratio 0.000252675 < max ratio 0.01
2020-01-24 08:51:23.119 7f4da6ab4ac0 10  failed to find any changes for overfull osds
2020-01-24 08:51:23.123 7f4da6ab4ac0 10  osd.321 target 262.934 deviation -1.93356 -> absolute ratio 0.00735381 < max ratio 0.01
2020-01-24 08:51:23.123 7f4da6ab4ac0 10  failed to find any changes for underfull osds
2020-01-24 08:51:23.123 7f4da6ab4ac0 10  break due to not being able to find any further optimizations
2020-01-24 08:51:23.127 7f4da6ab4ac0 10  num_changed = 0
no upmaps proposed
Fri Jan 24 08:51:23 CET 2020

I disabled balancer before!

root@ld3955:~# date && time ceph balancer status
Fri Jan 24 08:47:42 CET 2020
{
    "last_optimize_duration": "0:02:32.995319",
    "plans": [],
    "mode": "upmap",
    "active": false,
    "optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect",
    "last_optimize_started": "Fri Jan 24 08:45:04 2020" 
}

real    0m0,280s
user    0m0,228s
sys     0m0,018s

#10 Updated by Dan van der Ster 2 months ago

Thomas Schneider wrote:

I have enabled crush-upmap at a very early stage, however I cannot exclude that some OSDs still use crush-compat.
Please advise how to verify this.

ceph osd crush weight-set ls

If you have one, then don't just do the `ceph osd crush weight-set rm-compat` because it will likely trigger a lot of data movement.

#11 Updated by Thomas Schneider 2 months ago

With regards to weight-set I remember that I executeded this command in the past:

ceph osd crush weight-set rm-compat

#12 Updated by Thomas Schneider 2 months ago

root@ld3955:~# time ceph osd crush weight-set ls && date

real    0m0,264s
user    0m0,192s
sys     0m0,049s
Fri Jan 24 09:23:32 CET 2020

There's no output.

#13 Updated by David Zafman 2 months ago

Thomas:

Unless you want to install a development build you'll need to wait for the v14.2.7 Nautilus release.

#14 Updated by Thomas Schneider 2 months ago

I can wait.
However, can you please share some details about the changes in v14.2.7 with regards to balancer?
I'm asking this because there was communication in the past that with every release v14.2.x the balancer issue will be fixed.

#15 Updated by David Zafman about 2 months ago

Thomas Schneider wrote:

I can wait.
However, can you please share some details about the changes in v14.2.7 with regards to balancer?
I'm asking this because there was communication in the past that with every release v14.2.x the balancer issue will be fixed.

There is a bug fix which fixes CPU load and an inability sometimes for the algorithm to make progress. Some algorithm improvements like considering more low PG OSDs when trying to reduce high PG OSD. We also process one pool at a time, which avoids some spinning due to potential of large numbers of PGs. In addition, by default we only balance to within 5 PGs of the mean for each pool on each OSD.

#16 Updated by David Zafman about 1 month ago

  • Related to Bug #43752: Master tracker for upmap performance improvements added

Also available in: Atom PDF