Bug #43586: mgr/balancer reports "Unable to find further optimization ...", but distribution is not perfect - mgr - Ceph

Actions

Copy link

Bug #43586

open

mgr/balancer reports "Unable to find further optimization ...", but distribution is not perfect

Added by Thomas Schneider over 4 years ago. Updated almost 4 years ago.

Status:

New

Priority:

Normal

Assignee:

David Zafman

Category:

balancer module

Target version:

Ceph - v14.2.6

% Done:

Source:

Tags:

balancer

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,
I'm running ceph-mgr 14.2.6 with balancer enabled.
This is the status of ceph balancer:

root@ld3955:~# date && time ceph balancer status
Mon Jan 13 11:00:09 CET 2020
{
    "last_optimize_duration": "0:02:30.984016",
    "plans": [],
    "mode": "upmap",
    "active": true,
    "optimize_result": "in progress",
    "last_optimize_started": "Mon Jan 13 10:58:35 2020" 
}

real    0m0,380s
user    0m0,216s
sys     0m0,031s
root@ld3955:~# date && time ceph balancer status
Mon Jan 13 14:06:20 CET 2020
{
    "last_optimize_duration": "0:02:32.787459",
    "plans": [],
    "mode": "upmap",
    "active": true,
    "optimize_result": "in progress",
    "last_optimize_started": "Mon Jan 13 14:04:37 2020" 
}

real    0m0,987s
user    0m0,228s
sys     0m0,027s
root@ld3955:~# date && time ceph balancer status
Mon Jan 13 15:19:48 CET 2020
{
    "last_optimize_duration": "0:02:33.119116",
    "plans": [],
    "mode": "upmap",
    "active": true,
    "optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect",
    "last_optimize_started": "Mon Jan 13 15:16:20 2020" 
}

real    0m0,268s
user    0m0,220s
sys     0m0,025s

Since release 14.2.6 the command returns output within seconds (before it was minutes).

However, this output
"optimize_result": "Unable to find further optimization, or pool(s)' pg_num is decreasing, or distribution is already perfect"
is not accurate.

The data distribution of 1.6TB disks is extremely unbalanced:

root@ld3955:~# ceph osd df class hdd-strgbx  | awk '{ print "osd."$1, "size: "$5, "usage: " $17, "reweight: "$4 }' | sort -nk5 | grep -v 7.3 | head
osd.ID size: SIZE usage:  reweight: REWEIGHT
osd.MIN/MAX size: 12.42 usage:  reweight: STDDEV:
osd.TOTAL size: TiB usage:  reweight: 727
osd.205 size: 1.6 usage: 53.26 reweight: 1.00000
osd.100 size: 1.6 usage: 53.38 reweight: 1.00000
osd.243 size: 1.6 usage: 53.40 reweight: 1.00000
osd.255 size: 1.6 usage: 54.11 reweight: 1.00000
osd.154 size: 1.6 usage: 54.14 reweight: 1.00000
osd.106 size: 1.6 usage: 54.19 reweight: 1.00000
osd.262 size: 1.6 usage: 54.20 reweight: 1.00000
root@ld3955:~# ceph osd df class hdd-strgbx  | awk '{ print "osd."$1, "size: "$5, "usage: " $17, "reweight: "$4 }' | sort -nk5 | grep -v 7.3 | tail
osd.237 size: 1.6 usage: 77.80 reweight: 1.00000
osd.250 size: 1.6 usage: 77.81 reweight: 0.89999
osd.124 size: 1.6 usage: 77.89 reweight: 1.00000
osd.216 size: 1.6 usage: 78.45 reweight: 1.00000
osd.50 size: 1.6 usage: 78.49 reweight: 0.89999
osd.101 size: 1.6 usage: 78.72 reweight: 1.00000
osd.105 size: 1.6 usage: 79.20 reweight: 1.00000
osd.136 size: 1.6 usage: 79.47 reweight: 1.00000
osd.204 size: 1.6 usage: 80.24 reweight: 1.00000
osd.264 size: 1.6 usage: 83.17 reweight: 1.00000

This is the relevant ceph-mgr/balancer configuration:

root@ld3955:~# ceph config-key dump | grep balancer | grep -v config-history
    "config/mgr/mgr/balancer/active": "true",
    "config/mgr/mgr/balancer/mode": "upmap",
    "config/mgr/mgr/balancer/pool_ids": "11",
    "config/mgr/mgr/balancer/upmap_max_iterations": "20",
    "mgr/balancer/max_misplaced:": "0.01",
    "mgr/balancer/upmap_max_iterations": "20",

Executing manual optimization with osdmaptool returns the same result:

2020-01-08 17:30:12.398 7fbae3123ac0 10  failed to find any changes for overfull osds
2020-01-08 17:30:12.398 7fbae3123ac0 10  failed to find any changes for underfull osds
2020-01-08 17:30:12.398 7fbae3123ac0 10  break due to not being able to find any further optimizations
2020-01-08 17:30:12.402 7fbae3123ac0 10  num_changed = 0
no upmaps proposed

Considering these facts I would conclude that balancer's algorithm is not working properly.

Files

cmonty14_osdmap (200 KB) cmonty14_osdmap

Thomas Schneider, 01/16/2020 08:22 AM

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #43586

mgr/balancer reports "Unable to find further optimization ...", but distribution is not perfect

Updated by Neha Ojha over 4 years ago

Updated by David Zafman over 4 years ago

Updated by Thomas Schneider over 4 years ago

Updated by Dan van der Ster over 4 years ago

Updated by Thomas Schneider over 4 years ago

Updated by Jonas Jelten over 4 years ago

Updated by David Zafman over 4 years ago

Updated by David Zafman over 4 years ago

Updated by Thomas Schneider over 4 years ago

Updated by Dan van der Ster over 4 years ago

Updated by Thomas Schneider over 4 years ago

Updated by Thomas Schneider over 4 years ago

Updated by David Zafman over 4 years ago

Updated by Thomas Schneider over 4 years ago

Updated by David Zafman about 4 years ago

Updated by David Zafman about 4 years ago

Updated by Nathan Cutler almost 4 years ago