Project

General

Profile

Tasks #53622

User + Dev Monthly Meeting: Collect osdmaps for balancer testing

Added by Laura Flores 12 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
balancer module
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

Forthcoming balancer improvements could use more osdmaps for testing purposes.
If you would like to contribute yours, you may

1. attach your osdmap binary files (1000 KB max) to this Tracker ticket

or

2. use https://docs.ceph.com/en/latest/man/8/ceph-post-file/

nethub.osdmap (726 KB) Enrico Bocchi, 12/17/2021 04:23 PM

osd_loop.map - osd.map (372 KB) Konstantin Shalygin, 12/17/2021 07:25 PM

osdmap.7z - log (106MiB) (942 KB) Konstantin Shalygin, 12/17/2021 07:26 PM

osdmap-godaddy-20220721-pools.tar.xz - test-map-pgs-dump for each of the cephfs data pools (127 KB) Bryan Stillwell, 07/22/2022 05:31 PM

History

#1 Updated by Neha Ojha 12 months ago

  • Project changed from Ceph to mgr
  • Subject changed from User + Dev Monthly Meeting: Collect osdmaps for testing purposes to User + Dev Monthly Meeting: Collect osdmaps for balancer testing
  • Description updated (diff)
  • Category set to balancer module

#2 Updated by Neha Ojha 12 months ago

  • Description updated (diff)

#3 Updated by Enrico Bocchi 12 months ago

The attached osdmap is from a cluster unable to rebalance after the addition of one rack (4 hosts) of new hardware with a much higher weight than existing racks. The cluster is used for S3 storage and uses an erasure-coded (4+2) pool for bucket data with failure domain set to rack. The new rack resulted in having a crush weight higher than 1/6 of the overall weight (0.278 vs 0.166), hence its full capacity could not have been used due to the 4+2 EC and crush failure domain constraints.

The osd tree of the cluster is as follows:

  -1       8886.22266 root default                                                          
 -15       8886.22266     room 0773-R-0402                                                  
 -77        699.35577         rack HA06                                                     
 -76        174.83894             host cephnethub-data-08e0542d9e                           
-142        174.83894             host cephnethub-data-32672f0985                           
-106        174.83894             host cephnethub-data-62ead38716                           
-151        174.83894             host cephnethub-data-8b97064fe7                           
 -29        698.62939         rack HA07                                                     
 -28        174.65735             host cephnethub-data-0509dffff2                           
-154        174.65735             host cephnethub-data-8033235189                           
-145        174.65735             host cephnethub-data-a1951b6acc                           
-115        174.65735             host cephnethub-data-e2aedd6c61                           
-131        120.00000         rack HA08                                                     
-130        120.00000             host cephnethub-data-f1b6566110                           
-158       2472.74707         rack HA09                                                     
-157        618.18677             host cephnethub-data21-0f56b71981                         
-163        618.18677             host cephnethub-data21-69101c98c8                         
  -3        618.18677             host cephnethub-data21-becafc51ac                         
  -6        618.18677             host cephnethub-data21-db46980a1c        
[...cut...]

with:
- HA06 being one of the racks with old hardware
- HA09 being the rack with new hardware and much higher weight
- other 7 racks of old hardware not shown. Crush weight is the same as HA06

Under these circumstances, the balancer was struggling to build a plan and was not successful in moving data away from the most full osds.

#4 Updated by Andras Pataki 12 months ago

On our large cluster, the whole upmap balancing mechanism fails - looks like a bug in the ceph C++ code (not the mgr balancer module).
Here is the tracker (which includes an osd map): https://tracker.ceph.com/issues/51729
As a consequence I am using a completely different balancing mechanism that relies of crush weights only (essentially running a gradient descent optimizer simulating crush maps for proposed crush weight changes).

#5 Updated by Konstantin Shalygin 12 months ago

With this map osdmaptool (Nautilus 14.2.22) loop over the:

2021-12-17 22:16:28.247 7f00afeccc40 10  skipping overfull
2021-12-17 22:16:28.247 7f00afeccc40 10  failed to find any changes for overfull osds
2021-12-17 22:16:28.259 7f00afeccc40 10  will try dropping existing remapping pair 567 -> 456 which remapped 3.1229 out from underfull osd.567
2021-12-17 22:16:28.259 7f00afeccc40 10  existing pg_upmap_items [567,456] remapped 3.1229 out from underfull osd.567, will try cancelling it entirely

The command is:

osdmaptool osd.map --upmap-deviation 1 --upmap-max 10000 --upmap upmap.sh --debug_osd=20

Map attached,
Thanks

#6 Updated by Jake Grimmett 10 months ago

Two OSD maps uploaded from the LMB in Cambridge:

archive cephfs cluster
20 nodes, 5.0 PiB, 450x8TB HDD, 128 x 16TB HDD, 4 Intel Optane for MDS
ceph-post-file: e8bbc7bb-2f84-44dd-bd9e-26b24f6a4b13

Primary cephfs cluster
38 nodes, 4.6 PiB, 432x12TB HDD (+72 NVMe db/wal SSDs) 4 x Intel Optane OSD for MDS
ceph-post-file: 85ffe74b-6976-4c01-a55c-307cf5fde558

Many thanks

Jake

#7 Updated by Bryan Stillwell 4 months ago

I'm seeing this problem on multiple clusters that each have 20 CephFS filesystems of varying sizes. Here is the data distribution on one of the clusters:

cephfs01_data           3  512   24 TiB   99.94M   72 TiB  23.33     79 TiB
cephfs01_metadata       4   32   61 GiB   12.22M  182 GiB   0.07     79 TiB
cephfs02_data           5  512   17 TiB   94.32M   51 TiB  17.66     79 TiB
cephfs02_metadata       6   32   58 GiB   11.72M  175 GiB   0.07     79 TiB
cephfs03_data           7  512   13 TiB   88.21M   40 TiB  14.57     79 TiB
cephfs03_metadata       8   32   55 GiB   11.99M  166 GiB   0.07     79 TiB
cephfs04_data           9  512   17 TiB   94.21M   51 TiB  17.70     79 TiB
cephfs04_metadata      10   32   61 GiB   12.02M  182 GiB   0.07     79 TiB
cephfs05_data          11  512   15 TiB   88.47M   45 TiB  15.99     79 TiB
cephfs05_metadata      12   32   55 GiB   11.75M  165 GiB   0.07     79 TiB
cephfs06_data          13  512   12 TiB   97.00M   37 TiB  13.53     79 TiB
cephfs06_metadata      14   32   60 GiB   11.71M  179 GiB   0.07     79 TiB
cephfs07_data          15  512   16 TiB   96.39M   50 TiB  17.28     79 TiB
cephfs07_metadata      16   32   60 GiB   12.50M  179 GiB   0.07     79 TiB
cephfs08_data          17  512   13 TiB   89.43M   40 TiB  14.41     79 TiB
cephfs08_metadata      18   32   56 GiB   12.01M  169 GiB   0.07     79 TiB
cephfs09_data          19  512   16 TiB  109.49M   49 TiB  17.25     79 TiB
cephfs09_metadata      20   32   66 GiB   13.24M  199 GiB   0.08     79 TiB
cephfs10_data          21  512   13 TiB  100.06M   39 TiB  14.10     79 TiB
cephfs10_metadata      22   32   61 GiB   12.12M  183 GiB   0.08     79 TiB
cephfs11_data          23  512   11 TiB  107.09M   34 TiB  12.48     79 TiB
cephfs11_metadata      24   32   65 GiB   12.37M  194 GiB   0.08     79 TiB
cephfs12_data          25  512   14 TiB  105.16M   44 TiB  15.49     79 TiB
cephfs12_metadata      26   32   64 GiB   12.45M  192 GiB   0.08     79 TiB
cephfs13_data          27  512   18 TiB  107.92M   55 TiB  18.86     79 TiB
cephfs13_metadata      28   32   65 GiB   12.59M  194 GiB   0.08     79 TiB
cephfs14_data          29  512   13 TiB   93.04M   40 TiB  14.36     79 TiB
cephfs14_metadata      30   32   56 GiB   11.23M  169 GiB   0.07     79 TiB
cephfs15_data          31  512  9.7 TiB   98.24M   30 TiB  11.09     79 TiB
cephfs15_metadata      32   32   60 GiB   11.61M  179 GiB   0.07     79 TiB
cephfs16_data          33  512  7.0 TiB   68.55M   21 TiB   8.29     79 TiB
cephfs16_metadata      34   32   43 GiB    9.64M  129 GiB   0.05     79 TiB
cephfs17_data          35  512  5.6 TiB   65.69M   17 TiB   6.76     79 TiB
cephfs17_metadata      36   32   41 GiB    8.78M  123 GiB   0.05     79 TiB
cephfs18_data          37  512  6.7 TiB   76.94M   21 TiB   7.99     79 TiB
cephfs18_metadata      38   32   53 GiB   16.07M  159 GiB   0.07     79 TiB
cephfs19_data          39  512   12 TiB   65.40M   36 TiB  13.12     79 TiB
cephfs19_metadata      40   32   39 GiB    8.08M  118 GiB   0.05     79 TiB
cephfs20_data          41  512  7.6 TiB   65.21M   23 TiB   8.93     79 TiB
cephfs20_metadata      42   32   40 GiB    8.10M  120 GiB   0.05     79 TiB

This cluster has 176x 7.68TB OSDs, but even with upmap_max_deviation set to 3 we're seeing quite the spread in disk usage:

Least full OSD: 50.23% (156 PGs)
Most full OSD: 70.19% (214 PGs)

I've attached the output of 'osdmaptool <your osdmap> --test-map-pgs-dump --pool <pool id>' for each of the cephfs data pools.

Also available in: Atom PDF