Project

General

Profile

Bug #23921

pg-upmap cannot balance in some case

Added by huang jun about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
04/28/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

I have a cluster with 21 osds, cluster topology is

ID  CLASS WEIGHT   TYPE NAME               STATUS REWEIGHT PRI-AFF 
 -5       21.00000 root test                                       
 -7       11.00000     datacenter dc-1                             
 -9       11.00000         rack rack-1                             
-11        5.00000             host host-1                         
  5   hdd  1.00000                 osd.5       up  0.50000 1.00000 
  6   hdd  1.00000                 osd.6       up  1.00000 1.00000 
  7   hdd  1.00000                 osd.7       up  1.00000 1.00000 
  8   hdd  1.00000                 osd.8       up  1.00000 1.00000 
  9   hdd  1.00000                 osd.9       up  1.00000 1.00000 
-12        2.00000             host host-2                         
 16   hdd  1.00000                 osd.16      up  1.00000 1.00000 
 17   hdd  1.00000                 osd.17      up  1.00000 1.00000 
-13        2.00000             host host-3                         
 15   hdd  1.00000                 osd.15      up  1.00000 1.00000 
 18   hdd  1.00000                 osd.18      up  1.00000 1.00000 
-14        2.00000             host host-4                         
 19   hdd  1.00000                 osd.19      up  1.00000 1.00000 
 20   hdd  1.00000                 osd.20      up  1.00000 1.00000 
 -8       10.00000     datacenter dc-2                             
-10       10.00000         rack rack-2                             
-15        5.00000             host host-5                         
 10   hdd  1.00000                 osd.10      up  1.00000 1.00000 
 11   hdd  1.00000                 osd.11      up  1.00000 1.00000 
 12   hdd  1.00000                 osd.12      up  1.00000 1.00000 
 13   hdd  1.00000                 osd.13      up  1.00000 1.00000 
 14   hdd  1.00000                 osd.14      up  1.00000 1.00000 
-16        5.00000             host host-6                         
  0   hdd  1.00000                 osd.0       up  1.00000 1.00000 
  1   hdd  1.00000                 osd.1       up  1.00000 1.00000 
  2   hdd  1.00000                 osd.2       up  1.00000 1.00000 
  3   hdd  1.00000                 osd.3       up  1.00000 1.00000
  4   hdd  1.00000                 osd.4       up  1.00000 1.00000
 -1       21.00000 root default                                    
 -2       21.00000     host huangjun                               
  0   hdd  1.00000         osd.0               up  1.00000 1.00000 
  1   hdd  1.00000         osd.1               up  1.00000 1.00000 
  2   hdd  1.00000         osd.2               up  1.00000 1.00000 
  3   hdd  1.00000         osd.3               up  1.00000 1.00000 
  4   hdd  1.00000         osd.4               up  1.00000 1.00000 
  5   hdd  1.00000         osd.5               up  0.50000 1.00000 
  6   hdd  1.00000         osd.6               up  1.00000 1.00000 
  7   hdd  1.00000         osd.7               up  1.00000 1.00000 
  8   hdd  1.00000         osd.8               up  1.00000 1.00000 
  9   hdd  1.00000         osd.9               up  1.00000 1.00000 
 10   hdd  1.00000         osd.10              up  1.00000 1.00000 
 11   hdd  1.00000         osd.11              up  1.00000 1.00000 
 12   hdd  1.00000         osd.12              up  1.00000 1.00000 
 13   hdd  1.00000         osd.13              up  1.00000 1.00000 
 14   hdd  1.00000         osd.14              up  1.00000 1.00000 
 15   hdd  1.00000         osd.15              up  1.00000 1.00000 
 16   hdd  1.00000         osd.16              up  1.00000 1.00000 
 17   hdd  1.00000         osd.17              up  1.00000 1.00000 
 18   hdd  1.00000         osd.18              up  1.00000 1.00000 
 19   hdd  1.00000         osd.19              up  1.00000 1.00000 
 20   hdd  1.00000         osd.20              up  1.00000 1.00000  

create a pool with 1024pgs, 2 replicated size.
after remap, it shows no change
ceph osd df

ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS 
 5   hdd 1.00000  0.50000   981M 34176k   948M 3.40 1.00  40 
 6   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  99 
 7   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 109 
 8   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 121 
 9   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  95 
16   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  82 
17   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  91 
15   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  95 
18   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  93 
19   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 100 
20   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  99 
10   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  85 
11   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  94 
12   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00  81 
13   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 118 
14   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 102 
 0   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 107 
 1   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 113 
 2   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 106 
 3   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 110 
 4   hdd 1.00000  1.00000   981M 34176k   948M 3.40 1.00 108 

I check the log

2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.0 weight 0.1 pgs 107
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.1 weight 0.1 pgs 113
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.2 weight 0.1 pgs 106
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.3 weight 0.1 pgs 110
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.4 weight 0.1 pgs 108
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.5 weight 0.0454545 pgs 40
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.6 weight 0.0909091 pgs 99
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.7 weight 0.0909091 pgs 109
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.8 weight 0.0909091 pgs 121
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.9 weight 0.0909091 pgs 95
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.10 weight 0.1 pgs 85
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.11 weight 0.1 pgs 94
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.12 weight 0.1 pgs 81
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.13 weight 0.1 pgs 118
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.14 weight 0.1 pgs 102
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.15 weight 0.0909091 pgs 95
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.16 weight 0.0909091 pgs 82
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.17 weight 0.0909091 pgs 91
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.18 weight 0.0909091 pgs 93
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.19 weight 0.0909091 pgs 100
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.20 weight 0.0909091 pgs 99
2018-04-28 11:50:39.661 7f87a8cfd700 10  osd_weight_total 1.95455
2018-04-28 11:50:39.661 7f87a8cfd700 10  pgs_per_weight 1047.81
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.0  pgs 107 target 104.781  deviation 2.21863
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.1  pgs 113 target 104.781  deviation 8.21863
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.2  pgs 106 target 104.781  deviation 1.21863
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.3  pgs 110 target 104.781  deviation 5.21863
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.4  pgs 108 target 104.781  deviation 3.21863
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.5  pgs 40  target 47.6279  deviation -7.6279
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.6  pgs 99  target 95.2558  deviation 3.7442
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.7  pgs 109 target 95.2558  deviation 13.7442
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.8  pgs 121 target 95.2558  deviation 25.7442
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.9  pgs 95  target 95.2558  deviation -0.255798
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.10 pgs 85  target 104.781  deviation -19.7814
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.11 pgs 94  target 104.781  deviation -10.7814
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.12 pgs 81  target 104.781  deviation -23.7814
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.13 pgs 118 target 104.781  deviation 13.2186
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.14 pgs 102 target 104.781  deviation -2.78137
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.15 pgs 95  target 95.2558  deviation -0.255798
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.16 pgs 82  target 95.2558  deviation -13.2558
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.17 pgs 91  target 95.2558  deviation -4.2558
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.18 pgs 93  target 95.2558  deviation -2.2558
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.19 pgs 100 target 95.2558  deviation 4.7442
2018-04-28 11:50:39.661 7f87a8cfd700 20  osd.20 pgs 99  target 95.2558  deviation 3.7442
2018-04-28 11:50:39.661 7f87a8cfd700 10  total_deviation 170.065 overfull 0,1,2,3,4,6,7,8,13,19,20 underfull [12,10,16,11,5,17,14,18]
2018-04-28 11:50:39.661 7f87a8cfd700 10  osd.8 move 25
2018-04-28 11:50:39.661 7f87a8cfd700 10   trying 1.0
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_pg_upmap
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule ruleno 1 numrep 2 overfull 0,1,2,3,4,6,7,8,13,19,20 underfull [12,10,16,11,5,17,14,18] orig [8,13]
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 0 w []
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule take [-9]
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 1 w [-9]
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack stack [1,1,0,1] orig [8,13] at 8 pw [-9]
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack cumulative_fanout [1,1]
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 12 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 10 type 1 is -15
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 16 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 11 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 5 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 17 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 14 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 18 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack underfull_buckets [-15,-2]
2018-04-28 11:50:39.661 7f87a8cfd700 10  level 0: type 1 fanout 1 cumulative 1 w [-9]
2018-04-28 11:50:39.661 7f87a8cfd700 10  from -9
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack   from 13 got -2 of type 1 over leaves 8
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack  w <- [-2] was [-9]
2018-04-28 11:50:39.661 7f87a8cfd700 10  level 1: type 0 fanout 1 cumulative 1 w [-2]
2018-04-28 11:50:39.661 7f87a8cfd700 10  from -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 8 considering 12
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 replace 8 -> 12
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack  w <- [12] was [-2]
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 2 w [12]
2018-04-28 11:50:39.661 7f87a8cfd700 10  emit [12]
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 3 w []
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule take [-10]
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 4 w [-10]
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack stack [1,1,0,1] orig [8,13] at 13 pw [-10]
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack cumulative_fanout [1,1]
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 12 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 10 type 1 is -15
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 16 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 11 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 5 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 17 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 14 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 18 type 1 is -2
2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack underfull_buckets [-15,-2]
2018-04-28 11:50:39.661 7f87a8cfd700 10  level 0: type 1 fanout 1 cumulative 1 w [-10]
2018-04-28 11:50:39.661 7f87a8cfd700 10  from -10
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack   from -1142358840 got -2 of type 1 over leaves 13
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack  w <- [-2] was [-10]
2018-04-28 11:50:39.661 7f87a8cfd700 10  level 1: type 0 fanout 1 cumulative 1 w [-2]
2018-04-28 11:50:39.661 7f87a8cfd700 10  from -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 13 considering 12
2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack   in used 12
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 13 considering 10
2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack   not in subtree -2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 13 considering 16
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 replace 13 -> 16
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack end of orig, break 1
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack end of orig, break 2
2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack  w <- [16] was [-2]
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 5 w [16]
2018-04-28 11:50:39.661 7f87a8cfd700 10  emit [16]
2018-04-28 11:50:39.661 7f87a8cfd700 10 try_pg_upmap orig [8,13], out [12,16]
2018-04-28 11:50:39.661 7f87a8cfd700 10   1.0 [8,13] -> [12,16]
2018-04-28 11:50:39.661 7f87a8cfd700 10   1.0 pg_upmap_items [8,12,13,16]

2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps
2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps pg 1.0 crush-rule-id 1 weight_map {0=0.1,1=0.1,2=0.1,3=0.1,4=0.1,5=0.0909091,6=0.0909091,7=0.0909091,8=0.0909091,9=0
.0909091,10=0.1,11=0.1,12=0.1,13=0.1,14=0.1,15=0.0909091,16=0.0909091,17=0.0909091,18=0.0909091,19=0.0909091,20=0.0909091} failure-domain-type 1
2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps pg 1.0 osd 12 parent -2
2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps pg 1.0 osd 16 parent -2
2018-04-28 11:50:39.717 7f87ab502700 10 maybe_remove_pg_upmaps cancel invalid pending pg_upmap_items entry 1.0->[8,12,13,16]

PG 1.0 remap from 8,13 to 12,16
and in root bucket test, the osd.12 and osd.16 are not in the same host,
but get the same parent -2, that it is werid. so it will clear the upmap items.
because osd.12 and osd.16 in the same host huangjun, but which is not used for pool 'test'

pool 1 'test' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 104 lfor 0/102 flags hashpspool stripe_width 0 async_recovery_max_updates 200 osd_full_ratio 0.9

crush rule dump is
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default" 
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "osd" 
            },
            {
                "op": "emit" 
            }
        ]
    },
    {
        "rule_id": 1,
        "rule_name": "test",
        "ruleset": 1,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -9,
                "item_name": "rack-1" 
            },
            {
                "op": "chooseleaf_firstn",
                "num": 1,
                "type": "host" 
            },
            {
                "op": "emit" 
            },
            {
                "op": "take",
                "item": -10,
                "item_name": "rack-2" 
            },
            {
                "op": "chooseleaf_firstn",
                "num": 1,
                "type": "host" 
            },
            {
                "op": "emit" 
            }
        ]
    }
]


Related issues

Copied to RADOS - Backport #24026: mimic: pg-upmap cannot balance in some case Resolved
Copied to RADOS - Backport #24048: luminous: pg-upmap cannot balance in some case Resolved

History

#1 Updated by huang jun about 1 year ago

But if i unlink all osds from 'root default / host huangjun', every thing works ok.

for i in `seq 0 20`; do ./bin/ceph osd crush unlink osd.$i huangjun; done

#2 Updated by xie xingguo about 1 year ago

  • Project changed from mgr to RADOS
  • Category set to Correctness/Safety
  • Assignee set to xie xingguo
  • Severity changed from 3 - minor to 2 - major

#4 Updated by xie xingguo about 1 year ago

  • Status changed from New to Need Review

#5 Updated by Kefu Chai about 1 year ago

  • Copied to Backport #24026: mimic: pg-upmap cannot balance in some case added

#6 Updated by xie xingguo about 1 year ago

  • Status changed from Need Review to Pending Backport
  • Backport set to luminous

#7 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #24048: luminous: pg-upmap cannot balance in some case added

#8 Updated by Nathan Cutler about 1 year ago

  • Backport changed from luminous to luminous, mimic

#9 Updated by Nathan Cutler 12 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF