Project

General

Profile

Bug #42721

mgr/balancer: KeyError messages in balancer module

Added by Nikola Ciprich over 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
balancer module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we've hit problem with balancer on two of our cluster.
ceph health suddenly spits:
MGR_MODULE_ERROR Module 'balancer' has failed: (40,)

manager log then shows following:

2019-11-01 14:57:44.112 7f497f642700 -1 balancer.serve:
2019-11-01 14:57:44.112 7f497f642700 -1 Traceback (most recent call last):
File "/usr/lib64/ceph/mgr/balancer/module.py", line 425, in serve
r, detail = self.optimize(plan)
File "/usr/lib64/ceph/mgr/balancer/module.py", line 693, in optimize
return self.do_crush_compat(plan)
File "/usr/lib64/ceph/mgr/balancer/module.py", line 839, in do_crush_compat
weight = best_ws[osd]
KeyError: (40,)

we're using 13.2.6 on CENTOS7. don't have this problem on multiple other clusters running same version.

if I can provide further details, please let me know.

map.gz (3.88 KB) Nikola Ciprich, 11/14/2019 07:49 PM


Related issues

Duplicated by mgr - Bug #43181: Module 'balancer' has failed: (104,) - with Unhandled Exception Duplicate
Copied to mgr - Backport #44674: nautilus: mgr/balancer: KeyError messages in balancer module Resolved
Copied to mgr - Backport #44675: mimic: mgr/balancer: KeyError messages in balancer module Rejected
Copied to mgr - Bug #49576: mgr/balancer: KeyError messages in balancer module Resolved

History

#1 Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to mgr
  • Category deleted (common)

#2 Updated by Sage Weil over 4 years ago

  • Subject changed from problem with balancer module to problem with balancer module (mimic)
  • Status changed from New to Need More Info

Can you attach your osdmap and/or crush map? It's not clear to me why there would be a tuple instead of a name here. if you can 'ceph osd getmap -o map' and then attach the resulting file to this ticket that would be great. Thanks!

#3 Updated by Nikola Ciprich over 4 years ago

Hi Greh, sure! Attached is the map. BR. nik

#4 Updated by Lenz Grimmer over 4 years ago

  • Duplicated by Bug #43181: Module 'balancer' has failed: (104,) - with Unhandled Exception added

#5 Updated by Lenz Grimmer over 4 years ago

  • Affected Versions v14.2.4 added

This seems to affect Nautilus as well - see #43181 for a similar report:

2019-12-06 20:18:01.031 xxxxxxxxxxxx -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.xxxxx: (104,)
2019-12-06 20:18:01.031 xxxxxxxxxxxx -1 balancer.serve:
2019-12-06 20:18:01.031 xxxxxxxxxxxx -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/balancer/module.py", line 624, in serve
r, detail = self.optimize(plan)
File "/usr/share/ceph/mgr/balancer/module.py", line 891, in optimize
return self.do_crush_compat(plan)
File "/usr/share/ceph/mgr/balancer/module.py", line 1053, in do_crush_compat
weight = best_ws[osd]
KeyError: (104,)

#6 Updated by Lenz Grimmer over 4 years ago

  • Subject changed from problem with balancer module (mimic) to mgr: KeyError messages in balancer module
  • Severity changed from 3 - minor to 2 - major

#7 Updated by Lenz Grimmer over 4 years ago

  • Subject changed from mgr: KeyError messages in balancer module to mgr/balancer: KeyError messages in balancer module
  • Category set to balancer module
  • Backport set to nautilus, mimic

#8 Updated by Nikola Ciprich over 4 years ago

Hi, I just noticed this ticket is still in needmoreinfo state, I've provided the requested map, is there anything else I can provide to help?

#9 Updated by Lenz Grimmer about 4 years ago

  • Status changed from Need More Info to New
  • Priority changed from Normal to High

#10 Updated by Sage Weil about 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil

Finally figured this out!

The problem is in calc_eval().
- target_by_root does not include osd X because the crush weight is 0, but it has a weight-set weight > 0
- we initialize pgs_by_osd to 0 for each OSD in target_by_root osds
- in the loop over pm (pg_up_by_pool) we encounter a pg that maps to osd X
- pgs_by_osd[X] throws the KeyError because X isn't there

#11 Updated by Sage Weil about 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 34014

#12 Updated by Sage Weil about 4 years ago

  • Status changed from Fix Under Review to Pending Backport

#13 Updated by Konstantin Shalygin about 4 years ago

  • Copied to Backport #44674: nautilus: mgr/balancer: KeyError messages in balancer module added

#14 Updated by Konstantin Shalygin about 4 years ago

  • Copied to Backport #44675: mimic: mgr/balancer: KeyError messages in balancer module added

#15 Updated by Nikola Ciprich almost 4 years ago

Hi, just wanted to report, that I've hit the same problem on 14.2.8 with the fix applied. Haven't studied the code much more, but maybe there's similar problem further in the code:

Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.vfnjazv1
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 balancer.serve:
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 Traceback (most recent call last):
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 654, in serve
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: r, detail = self.optimize(plan)
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 924, in optimize
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: return self.do_crush_compat(plan)
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 1085, in do_crush_compat
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: weight = best_ws[osd]
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: KeyError: (76,)

#16 Updated by Nikola Ciprich over 3 years ago

got it, this other problem was caused by empty buckets:

-28 0 host vfnjazv1-ssd-test
-25 0 root ssdtest
-26 0 host vfnjazv1a-ssd-test
-27 0 host vfnjazv1b-ssd-test
-29 0 host vfnjazv1c-ssd-test

when I removed them, it's gone. So I guess this also needs fixing..

BR
nik

#17 Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

#18 Updated by David Zafman about 3 years ago

  • Copied to Bug #49576: mgr/balancer: KeyError messages in balancer module added

Also available in: Atom PDF