Actions
Bug #36301
openmgr/balancer: KeyError during balancer eval if pool migrating between roots
Status:
New
Priority:
Normal
Assignee:
-
Category:
balancer module
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
If we try to run `ceph balancer eval` while a pool is migrating data between roots, this error will occur:
# ceph balancer eval Error EINVAL: Traceback (most recent call last): File "/usr/lib64/ceph/mgr/balancer/module.py", line 321, in handle_command return (0, self.evaluate(ms, pools, verbose=verbose), '') File "/usr/lib64/ceph/mgr/balancer/module.py", line 620, in evaluate pe = self.calc_eval(ms, pools) File "/usr/lib64/ceph/mgr/balancer/module.py", line 507, in calc_eval pgs_by_osd[osd] += 1 KeyError: (1056,)
A fix for this would be:
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py index ca090516c9..faaa5b448e 100644 --- a/src/pybind/mgr/balancer/module.py +++ b/src/pybind/mgr/balancer/module.py @@ -525,7 +525,11 @@ class Module(MgrModule): for osd in [int(osd) for osd in up]: if osd == CRUSHMap.ITEM_NONE: continue - pgs_by_osd[osd] += 1 + try: + pgs_by_osd[osd] += 1 + except KeyError: + # this can occur if the cluster is migrating pgs between roots + pgs_by_osd[osd] = 1 objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects'] bytes_by_osd[osd] += ms.pg_stat[pgid]['num_bytes'] # pick a root to associate this pg instance with.
but I don't know if this is sufficient.
Suggestions how to test an mgr module.py change on a live cluster?
Updated by Dan van der Ster over 5 years ago
Obviously objects_by_osd and bytes_by_osd will need similar try/except.
Moving on, `ceph balancer optimize` (with upmap) crashes like this:
2018-10-03 16:19:01.437216 7f6e34949700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set<long int>&, OSDMap::Incremental*)' thread 7f6e34949700 time 2018-10-03 16:19:01.435308 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0)
The comment on this code indicates this isn't going to work:
// make sure osd is still there (belongs to this crush-tree) assert(osd_weight.count(osd)); float target = osd_weight[osd] * pgs_per_weight; assert(target > 0);
And to be clear, this occurs when we have pgs active+remapped+backfilling (moving from room=A to room=B).
Should we just fail more gracefully when PGs aren't where they are expected?
Updated by Laura Flores 7 months ago
- Translation missing: en.field_tag_list set to low-hanging-fruit
Actions