Project

General

Profile

Actions

Bug #49576

closed

mgr/balancer: KeyError messages in balancer module

Added by David Zafman about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
balancer module
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we've hit problem with balancer on two of our cluster.
ceph health suddenly spits:
MGR_MODULE_ERROR Module 'balancer' has failed: (40,)

manager log then shows following:

2019-11-01 14:57:44.112 7f497f642700 -1 balancer.serve:
2019-11-01 14:57:44.112 7f497f642700 -1 Traceback (most recent call last):
File "/usr/lib64/ceph/mgr/balancer/module.py", line 425, in serve
r, detail = self.optimize(plan)
File "/usr/lib64/ceph/mgr/balancer/module.py", line 693, in optimize
return self.do_crush_compat(plan)
File "/usr/lib64/ceph/mgr/balancer/module.py", line 839, in do_crush_compat
weight = best_ws[osd]
KeyError: (40,)

we're using 13.2.6 on CENTOS7. don't have this problem on multiple other clusters running same version.

if I can provide further details, please let me know.


Files

map.gz (3.88 KB) map.gz Nikola Ciprich, 11/14/2019 07:49 PM

Related issues 5 (0 open5 closed)

Has duplicate mgr - Bug #49535: nautilus: mgr/balancer: KeyError messages in balancer moduleDuplicate

Actions
Copied from mgr - Bug #42721: mgr/balancer: KeyError messages in balancer moduleResolvedSage Weil

Actions
Copied to mgr - Backport #49759: nautilus: mgr/balancer: KeyError messages in balancer moduleResolvedNeha OjhaActions
Copied to mgr - Backport #49760: pacific: mgr/balancer: KeyError messages in balancer moduleResolvedNeha OjhaActions
Copied to mgr - Backport #49761: octopus: mgr/balancer: KeyError messages in balancer moduleResolvedNeha OjhaActions
Actions #1

Updated by David Zafman about 3 years ago

  • Copied from Bug #42721: mgr/balancer: KeyError messages in balancer module added
Actions #2

Updated by David Zafman about 3 years ago

Updated by Nikola Ciprich 9 months ago

Hi, just wanted to report, that I've hit the same problem on 14.2.8 with the fix applied. Haven't studied the code much more, but maybe there's similar problem further in the code:

Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.vfnjazv1
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 balancer.serve:
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 Traceback (most recent call last):
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 654, in serve
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: r, detail = self.optimize(plan)
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 924, in optimize
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: return self.do_crush_compat(plan)
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 1085, in do_crush_compat
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: weight = best_ws[osd]
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: KeyError: (76,)

Actions #3

Updated by David Zafman about 3 years ago

Updated by Nikola Ciprich 7 months ago

got it, this other problem was caused by empty buckets:

-28 0 host vfnjazv1-ssd-test
-25 0 root ssdtest
-26 0 host vfnjazv1a-ssd-test
-27 0 host vfnjazv1b-ssd-test
-29 0 host vfnjazv1c-ssd-test

when I removed them, it's gone. So I guess this also needs fixing..

BR
nik

Actions #4

Updated by Neha Ojha about 3 years ago

  • Status changed from Resolved to New
  • Backport changed from nautilus, mimic to pacific, octopus, nautilus

It seems like this could happen when a bucket is present but the choose_args section of the crushmap doesn't include it. Kind of opposite of https://tracker.ceph.com/issues/24167.

Actions #5

Updated by Neha Ojha about 3 years ago

  • Has duplicate Bug #49535: nautilus: mgr/balancer: KeyError messages in balancer module added
Actions #6

Updated by Neha Ojha about 3 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID changed from 34014 to 40007
Actions #7

Updated by Neha Ojha about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49759: nautilus: mgr/balancer: KeyError messages in balancer module added
Actions #9

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49760: pacific: mgr/balancer: KeyError messages in balancer module added
Actions #10

Updated by Backport Bot about 3 years ago

  • Copied to Backport #49761: octopus: mgr/balancer: KeyError messages in balancer module added
Actions #11

Updated by Loïc Dachary about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF