Project

General

Profile

Bug #49576

mgr/balancer: KeyError messages in balancer module

Added by David Zafman about 2 months ago. Updated 12 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
balancer module
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we've hit problem with balancer on two of our cluster.
ceph health suddenly spits:
MGR_MODULE_ERROR Module 'balancer' has failed: (40,)

manager log then shows following:

2019-11-01 14:57:44.112 7f497f642700 -1 balancer.serve:
2019-11-01 14:57:44.112 7f497f642700 -1 Traceback (most recent call last):
File "/usr/lib64/ceph/mgr/balancer/module.py", line 425, in serve
r, detail = self.optimize(plan)
File "/usr/lib64/ceph/mgr/balancer/module.py", line 693, in optimize
return self.do_crush_compat(plan)
File "/usr/lib64/ceph/mgr/balancer/module.py", line 839, in do_crush_compat
weight = best_ws[osd]
KeyError: (40,)

we're using 13.2.6 on CENTOS7. don't have this problem on multiple other clusters running same version.

if I can provide further details, please let me know.

map.gz (3.88 KB) Nikola Ciprich, 11/14/2019 07:49 PM


Related issues

Duplicated by mgr - Bug #49535: nautilus: mgr/balancer: KeyError messages in balancer module Duplicate
Copied from mgr - Bug #42721: mgr/balancer: KeyError messages in balancer module Resolved
Copied to mgr - Backport #49759: nautilus: mgr/balancer: KeyError messages in balancer module Resolved
Copied to mgr - Backport #49760: pacific: mgr/balancer: KeyError messages in balancer module Resolved
Copied to mgr - Backport #49761: octopus: mgr/balancer: KeyError messages in balancer module Resolved

History

#1 Updated by David Zafman about 2 months ago

  • Copied from Bug #42721: mgr/balancer: KeyError messages in balancer module added

#2 Updated by David Zafman about 2 months ago

Updated by Nikola Ciprich 9 months ago

Hi, just wanted to report, that I've hit the same problem on 14.2.8 with the fix applied. Haven't studied the code much more, but maybe there's similar problem further in the code:

Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.vfnjazv1
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 balancer.serve:
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: 2020-06-14 21:29:34.173 7fa227f9f700 -1 Traceback (most recent call last):
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 654, in serve
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: r, detail = self.optimize(plan)
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 924, in optimize
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: return self.do_crush_compat(plan)
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: File "/usr/share/ceph/mgr/balancer/module.py", line 1085, in do_crush_compat
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: weight = best_ws[osd]
Jun 14 21:29:34 vfnjazv1a ceph-mgr2027947: KeyError: (76,)

#3 Updated by David Zafman about 2 months ago

Updated by Nikola Ciprich 7 months ago

got it, this other problem was caused by empty buckets:

-28 0 host vfnjazv1-ssd-test
-25 0 root ssdtest
-26 0 host vfnjazv1a-ssd-test
-27 0 host vfnjazv1b-ssd-test
-29 0 host vfnjazv1c-ssd-test

when I removed them, it's gone. So I guess this also needs fixing..

BR
nik

#4 Updated by Neha Ojha about 2 months ago

  • Status changed from Resolved to New
  • Backport changed from nautilus, mimic to pacific, octopus, nautilus

It seems like this could happen when a bucket is present but the choose_args section of the crushmap doesn't include it. Kind of opposite of https://tracker.ceph.com/issues/24167.

#5 Updated by Neha Ojha about 2 months ago

  • Duplicated by Bug #49535: nautilus: mgr/balancer: KeyError messages in balancer module added

#6 Updated by Neha Ojha about 1 month ago

  • Status changed from New to Fix Under Review
  • Pull request ID changed from 34014 to 40007

#7 Updated by Neha Ojha about 1 month ago

  • Status changed from Fix Under Review to Pending Backport

#8 Updated by Backport Bot about 1 month ago

  • Copied to Backport #49759: nautilus: mgr/balancer: KeyError messages in balancer module added

#9 Updated by Backport Bot about 1 month ago

  • Copied to Backport #49760: pacific: mgr/balancer: KeyError messages in balancer module added

#10 Updated by Backport Bot about 1 month ago

  • Copied to Backport #49761: octopus: mgr/balancer: KeyError messages in balancer module added

#11 Updated by Loïc Dachary 12 days ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF