Project

General

Profile

Backport #24026

Updated by Kefu Chai almost 6 years ago

https://github.com/ceph/ceph/pull/21835 I have a cluster with 21 osds, cluster topology is 
 <pre> 
 ID    CLASS WEIGHT     TYPE NAME                 STATUS REWEIGHT PRI-AFF  
  -5         21.00000 root test                                        
  -7         11.00000       datacenter dc-1                              
  -9         11.00000           rack rack-1                              
 -11          5.00000               host host-1                          
   5     hdd    1.00000                   osd.5         up    0.50000 1.00000  
   6     hdd    1.00000                   osd.6         up    1.00000 1.00000  
   7     hdd    1.00000                   osd.7         up    1.00000 1.00000  
   8     hdd    1.00000                   osd.8         up    1.00000 1.00000  
   9     hdd    1.00000                   osd.9         up    1.00000 1.00000  
 -12          2.00000               host host-2                          
  16     hdd    1.00000                   osd.16        up    1.00000 1.00000  
  17     hdd    1.00000                   osd.17        up    1.00000 1.00000  
 -13          2.00000               host host-3                          
  15     hdd    1.00000                   osd.15        up    1.00000 1.00000  
  18     hdd    1.00000                   osd.18        up    1.00000 1.00000  
 -14          2.00000               host host-4                          
  19     hdd    1.00000                   osd.19        up    1.00000 1.00000  
  20     hdd    1.00000                   osd.20        up    1.00000 1.00000  
  -8         10.00000       datacenter dc-2                              
 -10         10.00000           rack rack-2                              
 -15          5.00000               host host-5                          
  10     hdd    1.00000                   osd.10        up    1.00000 1.00000  
  11     hdd    1.00000                   osd.11        up    1.00000 1.00000  
  12     hdd    1.00000                   osd.12        up    1.00000 1.00000  
  13     hdd    1.00000                   osd.13        up    1.00000 1.00000  
  14     hdd    1.00000                   osd.14        up    1.00000 1.00000  
 -16          5.00000               host host-6                          
   0     hdd    1.00000                   osd.0         up    1.00000 1.00000  
   1     hdd    1.00000                   osd.1         up    1.00000 1.00000  
   2     hdd    1.00000                   osd.2         up    1.00000 1.00000  
   3     hdd    1.00000                   osd.3         up    1.00000 1.00000 
   4     hdd    1.00000                   osd.4         up    1.00000 1.00000 
  -1         21.00000 root default                                     
  -2         21.00000       host huangjun                                
   0     hdd    1.00000           osd.0                 up    1.00000 1.00000  
   1     hdd    1.00000           osd.1                 up    1.00000 1.00000  
   2     hdd    1.00000           osd.2                 up    1.00000 1.00000  
   3     hdd    1.00000           osd.3                 up    1.00000 1.00000  
   4     hdd    1.00000           osd.4                 up    1.00000 1.00000  
   5     hdd    1.00000           osd.5                 up    0.50000 1.00000  
   6     hdd    1.00000           osd.6                 up    1.00000 1.00000  
   7     hdd    1.00000           osd.7                 up    1.00000 1.00000  
   8     hdd    1.00000           osd.8                 up    1.00000 1.00000  
   9     hdd    1.00000           osd.9                 up    1.00000 1.00000  
  10     hdd    1.00000           osd.10                up    1.00000 1.00000  
  11     hdd    1.00000           osd.11                up    1.00000 1.00000  
  12     hdd    1.00000           osd.12                up    1.00000 1.00000  
  13     hdd    1.00000           osd.13                up    1.00000 1.00000  
  14     hdd    1.00000           osd.14                up    1.00000 1.00000  
  15     hdd    1.00000           osd.15                up    1.00000 1.00000  
  16     hdd    1.00000           osd.16                up    1.00000 1.00000  
  17     hdd    1.00000           osd.17                up    1.00000 1.00000  
  18     hdd    1.00000           osd.18                up    1.00000 1.00000  
  19     hdd    1.00000           osd.19                up    1.00000 1.00000  
  20     hdd    1.00000           osd.20                up    1.00000 1.00000   
 </pre> 

 create a pool with 1024pgs, 2 replicated size. 
 after remap, it shows no change 
 ceph osd df 
 <pre> 
 ID CLASS WEIGHT    REWEIGHT SIZE     USE      AVAIL    %USE VAR    PGS  
  5     hdd 1.00000    0.50000     981M 34176k     948M 3.40 1.00    40  
  6     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    99  
  7     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 109  
  8     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 121  
  9     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    95  
 16     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    82  
 17     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    91  
 15     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    95  
 18     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    93  
 19     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 100  
 20     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    99  
 10     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    85  
 11     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    94  
 12     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00    81  
 13     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 118  
 14     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 102  
  0     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 107  
  1     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 113  
  2     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 106  
  3     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 110  
  4     hdd 1.00000    1.00000     981M 34176k     948M 3.40 1.00 108  
 </pre> 

 I check the log 
 <pre> 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.0 weight 0.1 pgs 107 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.1 weight 0.1 pgs 113 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.2 weight 0.1 pgs 106 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.3 weight 0.1 pgs 110 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.4 weight 0.1 pgs 108 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.5 weight 0.0454545 pgs 40 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.6 weight 0.0909091 pgs 99 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.7 weight 0.0909091 pgs 109 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.8 weight 0.0909091 pgs 121 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.9 weight 0.0909091 pgs 95 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.10 weight 0.1 pgs 85 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.11 weight 0.1 pgs 94 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.12 weight 0.1 pgs 81 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.13 weight 0.1 pgs 118 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.14 weight 0.1 pgs 102 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.15 weight 0.0909091 pgs 95 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.16 weight 0.0909091 pgs 82 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.17 weight 0.0909091 pgs 91 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.18 weight 0.0909091 pgs 93 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.19 weight 0.0909091 pgs 100 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.20 weight 0.0909091 pgs 99 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    osd_weight_total 1.95455 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    pgs_per_weight 1047.81 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.0    pgs 107 target 104.781    deviation 2.21863 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.1    pgs 113 target 104.781    deviation 8.21863 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.2    pgs 106 target 104.781    deviation 1.21863 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.3    pgs 110 target 104.781    deviation 5.21863 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.4    pgs 108 target 104.781    deviation 3.21863 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.5    pgs 40    target 47.6279    deviation -7.6279 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.6    pgs 99    target 95.2558    deviation 3.7442 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.7    pgs 109 target 95.2558    deviation 13.7442 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.8    pgs 121 target 95.2558    deviation 25.7442 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.9    pgs 95    target 95.2558    deviation -0.255798 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.10 pgs 85    target 104.781    deviation -19.7814 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.11 pgs 94    target 104.781    deviation -10.7814 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.12 pgs 81    target 104.781    deviation -23.7814 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.13 pgs 118 target 104.781    deviation 13.2186 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.14 pgs 102 target 104.781    deviation -2.78137 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.15 pgs 95    target 95.2558    deviation -0.255798 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.16 pgs 82    target 95.2558    deviation -13.2558 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.17 pgs 91    target 95.2558    deviation -4.2558 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.18 pgs 93    target 95.2558    deviation -2.2558 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.19 pgs 100 target 95.2558    deviation 4.7442 
 2018-04-28 11:50:39.661 7f87a8cfd700 20    osd.20 pgs 99    target 95.2558    deviation 3.7442 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    total_deviation 170.065 overfull 0,1,2,3,4,6,7,8,13,19,20 underfull [12,10,16,11,5,17,14,18] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    osd.8 move 25 
 2018-04-28 11:50:39.661 7f87a8cfd700 10     trying 1.0 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_pg_upmap 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule ruleno 1 numrep 2 overfull 0,1,2,3,4,6,7,8,13,19,20 underfull [12,10,16,11,5,17,14,18] orig [8,13] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 0 w [] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule take [-9] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 1 w [-9] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack stack [1,1,0,1] orig [8,13] at 8 pw [-9] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack cumulative_fanout [1,1] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 12 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 10 type 1 is -15 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 16 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 11 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 5 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 17 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 14 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 18 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack underfull_buckets [-15,-2] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    level 0: type 1 fanout 1 cumulative 1 w [-9] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    from -9 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack     from 13 got -2 of type 1 over leaves 8 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack    w <- [-2] was [-9] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    level 1: type 0 fanout 1 cumulative 1 w [-2] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    from -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 8 considering 12 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 replace 8 -> 12 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack    w <- [12] was [-2] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 2 w [12] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    emit [12] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 3 w [] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule take [-10] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 4 w [-10] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack stack [1,1,0,1] orig [8,13] at 13 pw [-10] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack cumulative_fanout [1,1] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 12 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 10 type 1 is -15 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 16 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 11 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 5 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 17 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 14 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack underfull 18 type 1 is -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack underfull_buckets [-15,-2] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    level 0: type 1 fanout 1 cumulative 1 w [-10] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    from -10 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack     from -1142358840 got -2 of type 1 over leaves 13 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack    w <- [-2] was [-10] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    level 1: type 0 fanout 1 cumulative 1 w [-2] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    from -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 13 considering 12 
 2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack     in used 12 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 13 considering 10 
 2018-04-28 11:50:39.661 7f87a8cfd700 20 _choose_type_stack     not in subtree -2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 was 13 considering 16 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack pos 0 replace 13 -> 16 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack end of orig, break 1 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack end of orig, break 2 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 _choose_type_stack    w <- [16] was [-2] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_remap_rule step 5 w [16] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10    emit [16] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10 try_pg_upmap orig [8,13], out [12,16] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10     1.0 [8,13] -> [12,16] 
 2018-04-28 11:50:39.661 7f87a8cfd700 10     1.0 pg_upmap_items [8,12,13,16] 

 2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps 
 2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps pg 1.0 crush-rule-id 1 weight_map {0=0.1,1=0.1,2=0.1,3=0.1,4=0.1,5=0.0909091,6=0.0909091,7=0.0909091,8=0.0909091,9=0 
 .0909091,10=0.1,11=0.1,12=0.1,13=0.1,14=0.1,15=0.0909091,16=0.0909091,17=0.0909091,18=0.0909091,19=0.0909091,20=0.0909091} failure-domain-type 1 
 2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps pg 1.0 osd 12 parent -2 
 2018-04-28 11:50:39.716 7f87ab502700 10 maybe_remove_pg_upmaps pg 1.0 osd 16 parent -2 
 2018-04-28 11:50:39.717 7f87ab502700 10 maybe_remove_pg_upmaps cancel invalid pending pg_upmap_items entry 1.0->[8,12,13,16] 
 </pre> 

 PG 1.0 remap from 8,13 to 12,16 
 and in root bucket test, the osd.12 and osd.16 are not in the same host,  
 but get the same parent -2, that it is werid. so it will clear the upmap items. 
 because osd.12 and osd.16 in the same host huangjun, but which is not used for pool 'test' 
 <pre> 
 pool 1 'test' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 104 lfor 0/102 flags hashpspool stripe_width 0 async_recovery_max_updates 200 osd_full_ratio 0.9 
 </pre> 
 crush rule dump is 
 <pre> 
 [ 
     { 
         "rule_id": 0, 
         "rule_name": "replicated_rule", 
         "ruleset": 0, 
         "type": 1, 
         "min_size": 1, 
         "max_size": 10, 
         "steps": [ 
             { 
                 "op": "take", 
                 "item": -1, 
                 "item_name": "default" 
             }, 
             { 
                 "op": "choose_firstn", 
                 "num": 0, 
                 "type": "osd" 
             }, 
             { 
                 "op": "emit" 
             } 
         ] 
     }, 
     { 
         "rule_id": 1, 
         "rule_name": "test", 
         "ruleset": 1, 
         "type": 1, 
         "min_size": 1, 
         "max_size": 10, 
         "steps": [ 
             { 
                 "op": "take", 
                 "item": -9, 
                 "item_name": "rack-1" 
             }, 
             { 
                 "op": "chooseleaf_firstn", 
                 "num": 1, 
                 "type": "host" 
             }, 
             { 
                 "op": "emit" 
             }, 
             { 
                 "op": "take", 
                 "item": -10, 
                 "item_name": "rack-2" 
             }, 
             { 
                 "op": "chooseleaf_firstn", 
                 "num": 1, 
                 "type": "host" 
             }, 
             { 
                 "op": "emit" 
             } 
         ] 
     } 
 ] 

 </pre>

Back