Project

General

Profile

Actions

Bug #25183

closed

The ceph-mgr balancer stopped hangs when attempting to balance cluster

Added by Bryan Stillwell over 5 years ago. Updated almost 3 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
balancer module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Problem:
When using the ceph-mgr balancer in 13.2.1 (and 12.2.5 previously), trying to create an optimized plan results in the ceph-mgr hanging.

I'm using the upmap mode for the balancer:

  1. ceph balancer status {
    "active": false,
    "plans": [],
    "mode": "upmap"
    }

Log messages look like this:
2018-07-30 15:45:14.979 7fe096cca700 1 mgr[balancer] Handling command: '{'prefix': 'balancer optimize', 'plan': 'run20180730', 'target': [
'mgr', '']}'
2018-07-30 15:45:15.063 7fe096cca700 4 mgr[balancer] Optimize plan run20180730
2018-07-30 15:45:15.063 7fe096cca700 4 mgr get_config get_config key: mgr/balancer/mode
2018-07-30 15:45:15.063 7fe096cca700 4 mgr get_config get_config key: mgr/balancer/max_misplaced
2018-07-30 15:45:15.063 7fe096cca700 4 mgr[balancer] Mode upmap, max misplaced 0.010000
2018-07-30 15:45:15.063 7fe096cca700 4 mgr[balancer] do_upmap
2018-07-30 15:45:15.063 7fe096cca700 4 mgr get_config get_config key: mgr/balancer/upmap_max_iterations
2018-07-30 15:45:15.063 7fe096cca700 4 ceph_config_get upmap_max_iterations not found
2018-07-30 15:45:15.067 7fe096cca700 4 mgr get_config get_config key: mgr/balancer/upmap_max_deviation
2018-07-30 15:45:15.067 7fe096cca700 4 ceph_config_get upmap_max_deviation not found
2018-07-30 15:45:15.067 7fe096cca700 4 mgr[balancer] pools ['rbd', 'cephfs_data_ec42', 'cephfs_data', 'cephfs_metadata']

Nothing else related to balancing is seen after that.

Expected result:
Another pass is done by the balancer to bring the cluster a step closer to being balanced.

Additional notes:
Trying to manually optimize the cluster results in a segfault:

  1. ceph osd getmap -o osdmap-20180730.bin
    got osdmap epoch 101015
  2. osdmaptool osdmap-20180730.bin --upmap upmaps-20180730.txt
    osdmaptool: osdmap file 'osdmap-20180730.bin'
    writing upmap command output to: upmaps-20180730.txt
    checking for upmap cleanups
    upmap, max-count 100, max deviation 0.01
    • Caught signal (Segmentation fault) **
      in thread 7fe69b6da8c0 thread_name:osdmaptool
      Segmentation fault (core dumped)
Actions #1

Updated by Bryan Stillwell over 5 years ago

Just saw the note in the docs on how to enable debugging with osdmaptool:

  1. osdmaptool osdmap-20180730.bin --debug-osd 10 --upmap upmaps-20180730.txt
    osdmaptool: osdmap file 'osdmap-20180730.bin'
    writing upmap command output to: upmaps-20180730.txt
    checking for upmap cleanups
    2018-07-30 16:07:39.434 7f7a4c1618c0 10 clean_pg_upmaps
    upmap, max-count 100, max deviation 0.01
    2018-07-30 16:07:39.442 7f7a4c1618c0 10 osd_weight_total 4
    2018-07-30 16:07:39.442 7f7a4c1618c0 10 pgs_per_weight 672
    2018-07-30 16:07:39.442 7f7a4c1618c0 10 total_deviation 156.119 overfull 1,2,5,6,9,10,11,12 underfull [17,18,4,0,8]
    2018-07-30 16:07:39.442 7f7a4c1618c0 10 osd.5 move 19
    2018-07-30 16:07:39.442 7f7a4c1618c0 10 dropping pg_upmap_items 2.3 [17,5,5,11,4,5]
    2018-07-30 16:07:39.442 7f7a4c1618c0 10 dropping pg_upmap_items 2.3 [961017152,22060,5,11,4,5]
    • Caught signal (Segmentation fault) **
      in thread 7f7a4c1618c0 thread_name:osdmaptool
      Segmentation fault (core dumped)
Actions #2

Updated by John Spray over 5 years ago

  • Project changed from Ceph to mgr

To clarify, when you say "Nothing else related to balancing is seen after that.", you mean that the cluster (including ceph-mgr) is otherwise functioning normally, you're just not getting the expected balancing activity?

Actions #3

Updated by Bryan Stillwell over 5 years ago

What I meant was that there were no more logs reported after that which appeared to be related to balancing the cluster. The ceph-mgr process was still running, but it didn't seem healthy. I would describe it in more detail, but it would be from memory and I might not be completely accurate.

Based on the log message "dropping pg_upmap_items 2.3 [961017152,22060,5,11,4,5]", I decided to try removing a bunch of the upmap entries since those OSD numbers definitely weren't valid. After doing that it seems to have cleared up my problem! :)

Actions #4

Updated by Sebastian Wagner about 5 years ago

  • Category set to balancer module
Actions #5

Updated by Bryan Stillwell about 5 years ago

I haven't had a problem again since removing the erroneous pg_upmap_items entries. I'm also now on Nautilus 14.2.0.

Actions #6

Updated by Konstantin Shalygin almost 3 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF