Actions
Bug #36301
openmgr/balancer: KeyError during balancer eval if pool migrating between roots
Status:
New
Priority:
Normal
Assignee:
-
Category:
balancer module
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
If we try to run `ceph balancer eval` while a pool is migrating data between roots, this error will occur:
# ceph balancer eval Error EINVAL: Traceback (most recent call last): File "/usr/lib64/ceph/mgr/balancer/module.py", line 321, in handle_command return (0, self.evaluate(ms, pools, verbose=verbose), '') File "/usr/lib64/ceph/mgr/balancer/module.py", line 620, in evaluate pe = self.calc_eval(ms, pools) File "/usr/lib64/ceph/mgr/balancer/module.py", line 507, in calc_eval pgs_by_osd[osd] += 1 KeyError: (1056,)
A fix for this would be:
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py index ca090516c9..faaa5b448e 100644 --- a/src/pybind/mgr/balancer/module.py +++ b/src/pybind/mgr/balancer/module.py @@ -525,7 +525,11 @@ class Module(MgrModule): for osd in [int(osd) for osd in up]: if osd == CRUSHMap.ITEM_NONE: continue - pgs_by_osd[osd] += 1 + try: + pgs_by_osd[osd] += 1 + except KeyError: + # this can occur if the cluster is migrating pgs between roots + pgs_by_osd[osd] = 1 objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects'] bytes_by_osd[osd] += ms.pg_stat[pgid]['num_bytes'] # pick a root to associate this pg instance with.
but I don't know if this is sufficient.
Suggestions how to test an mgr module.py change on a live cluster?
Actions