Bug #48309
a few scrubs or remapped PGs blocks the upmap balancer
% Done:
0%
Source:
Tags:
Backport:
octopus
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
In the balancer/module.py we have
left = max_optimizations ... available = left - (num_pg - num_pg_active_clean) did = plan.osdmap.calc_pg_upmaps(inc, max_deviation, available, [pool])
The intention is not to balance PGs which are scrubbing, but this calculation of `available` means that if you have more than 10 PGs not in "active+clean" state (even out of thousands), there will be no balancing; instead you will get:
2020-11-20T11:28:28.966+0100 7fcee27a0700 -1 calc_pg_upmaps abort due to max <= 0
For example, in our cluster right now:
8493 active+clean 20 active+remapped+backfilling
and with `mgr/balancer/upmap_max_optimizations = 20` there is no more balancing triggered (max = 0).
This behaviour was introduced in 86444dbfe5d8478530182116507d034d9e180f5e. It wasn't backported to nautilus which is why we didn't notice until now.
I will send a PR with this fix, which I think was the intention here:
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py index 5fffe01dcb..dd9fac46ec 100644 --- a/src/pybind/mgr/balancer/module.py +++ b/src/pybind/mgr/balancer/module.py @@ -1023,7 +1023,7 @@ class Module(MgrModule): if s['state_name'] == 'active+clean': num_pg_active_clean += s['count'] break - available = left - (num_pg - num_pg_active_clean) + available = min(left, num_pg_active_clean) did = plan.osdmap.calc_pg_upmaps(inc, max_deviation, available, [pool]) total_did += did left -= did
Related issues
History
#1 Updated by Dan van der Ster over 2 years ago
- Pull request ID set to 38206
#2 Updated by Kefu Chai over 2 years ago
- Status changed from New to Fix Under Review
- Assignee set to Dan van der Ster
#3 Updated by Kefu Chai over 2 years ago
- Status changed from Fix Under Review to Pending Backport
#4 Updated by Konstantin Shalygin over 2 years ago
- Copied to Backport #48399: octopus: a few scrubs or remapped PGs blocks the upmap balancer added
#5 Updated by Nathan Cutler about 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".