Tasks #53622
openUser + Dev Monthly Meeting: Collect osdmaps for balancer testing
0%
Description
Forthcoming balancer improvements could use more osdmaps for testing purposes.
If you would like to contribute yours, you may
1. attach your osdmap binary files (1000 KB max) to this Tracker ticket
or
2. use https://docs.ceph.com/en/latest/man/8/ceph-post-file/
Files
Updated by Neha Ojha over 2 years ago
- Project changed from Ceph to mgr
- Subject changed from User + Dev Monthly Meeting: Collect osdmaps for testing purposes to User + Dev Monthly Meeting: Collect osdmaps for balancer testing
- Description updated (diff)
- Category set to balancer module
Updated by Enrico Bocchi over 2 years ago
- File nethub.osdmap nethub.osdmap added
The attached osdmap is from a cluster unable to rebalance after the addition of one rack (4 hosts) of new hardware with a much higher weight than existing racks. The cluster is used for S3 storage and uses an erasure-coded (4+2) pool for bucket data with failure domain set to rack. The new rack resulted in having a crush weight higher than 1/6 of the overall weight (0.278 vs 0.166), hence its full capacity could not have been used due to the 4+2 EC and crush failure domain constraints.
The osd tree of the cluster is as follows:
-1 8886.22266 root default -15 8886.22266 room 0773-R-0402 -77 699.35577 rack HA06 -76 174.83894 host cephnethub-data-08e0542d9e -142 174.83894 host cephnethub-data-32672f0985 -106 174.83894 host cephnethub-data-62ead38716 -151 174.83894 host cephnethub-data-8b97064fe7 -29 698.62939 rack HA07 -28 174.65735 host cephnethub-data-0509dffff2 -154 174.65735 host cephnethub-data-8033235189 -145 174.65735 host cephnethub-data-a1951b6acc -115 174.65735 host cephnethub-data-e2aedd6c61 -131 120.00000 rack HA08 -130 120.00000 host cephnethub-data-f1b6566110 -158 2472.74707 rack HA09 -157 618.18677 host cephnethub-data21-0f56b71981 -163 618.18677 host cephnethub-data21-69101c98c8 -3 618.18677 host cephnethub-data21-becafc51ac -6 618.18677 host cephnethub-data21-db46980a1c [...cut...]
with:
- HA06 being one of the racks with old hardware
- HA09 being the rack with new hardware and much higher weight
- other 7 racks of old hardware not shown. Crush weight is the same as HA06
Under these circumstances, the balancer was struggling to build a plan and was not successful in moving data away from the most full osds.
Updated by Andras Pataki over 2 years ago
On our large cluster, the whole upmap balancing mechanism fails - looks like a bug in the ceph C++ code (not the mgr balancer module).
Here is the tracker (which includes an osd map): https://tracker.ceph.com/issues/51729
As a consequence I am using a completely different balancing mechanism that relies of crush weights only (essentially running a gradient descent optimizer simulating crush maps for proposed crush weight changes).
Updated by Konstantin Shalygin over 2 years ago
- File osd_loop.map osd_loop.map added
- File osdmap.7z osdmap.7z added
With this map osdmaptool (Nautilus 14.2.22) loop over the:
2021-12-17 22:16:28.247 7f00afeccc40 10 skipping overfull 2021-12-17 22:16:28.247 7f00afeccc40 10 failed to find any changes for overfull osds 2021-12-17 22:16:28.259 7f00afeccc40 10 will try dropping existing remapping pair 567 -> 456 which remapped 3.1229 out from underfull osd.567 2021-12-17 22:16:28.259 7f00afeccc40 10 existing pg_upmap_items [567,456] remapped 3.1229 out from underfull osd.567, will try cancelling it entirely
The command is:
osdmaptool osd.map --upmap-deviation 1 --upmap-max 10000 --upmap upmap.sh --debug_osd=20
Map attached,
Thanks
Updated by Jake Grimmett over 2 years ago
Two OSD maps uploaded from the LMB in Cambridge:
archive cephfs cluster
20 nodes, 5.0 PiB, 450x8TB HDD, 128 x 16TB HDD, 4 Intel Optane for MDS
ceph-post-file: e8bbc7bb-2f84-44dd-bd9e-26b24f6a4b13
Primary cephfs cluster
38 nodes, 4.6 PiB, 432x12TB HDD (+72 NVMe db/wal SSDs) 4 x Intel Optane OSD for MDS
ceph-post-file: 85ffe74b-6976-4c01-a55c-307cf5fde558
Many thanks
Jake
Updated by Bryan Stillwell almost 2 years ago
I'm seeing this problem on multiple clusters that each have 20 CephFS filesystems of varying sizes. Here is the data distribution on one of the clusters:
cephfs01_data 3 512 24 TiB 99.94M 72 TiB 23.33 79 TiB cephfs01_metadata 4 32 61 GiB 12.22M 182 GiB 0.07 79 TiB cephfs02_data 5 512 17 TiB 94.32M 51 TiB 17.66 79 TiB cephfs02_metadata 6 32 58 GiB 11.72M 175 GiB 0.07 79 TiB cephfs03_data 7 512 13 TiB 88.21M 40 TiB 14.57 79 TiB cephfs03_metadata 8 32 55 GiB 11.99M 166 GiB 0.07 79 TiB cephfs04_data 9 512 17 TiB 94.21M 51 TiB 17.70 79 TiB cephfs04_metadata 10 32 61 GiB 12.02M 182 GiB 0.07 79 TiB cephfs05_data 11 512 15 TiB 88.47M 45 TiB 15.99 79 TiB cephfs05_metadata 12 32 55 GiB 11.75M 165 GiB 0.07 79 TiB cephfs06_data 13 512 12 TiB 97.00M 37 TiB 13.53 79 TiB cephfs06_metadata 14 32 60 GiB 11.71M 179 GiB 0.07 79 TiB cephfs07_data 15 512 16 TiB 96.39M 50 TiB 17.28 79 TiB cephfs07_metadata 16 32 60 GiB 12.50M 179 GiB 0.07 79 TiB cephfs08_data 17 512 13 TiB 89.43M 40 TiB 14.41 79 TiB cephfs08_metadata 18 32 56 GiB 12.01M 169 GiB 0.07 79 TiB cephfs09_data 19 512 16 TiB 109.49M 49 TiB 17.25 79 TiB cephfs09_metadata 20 32 66 GiB 13.24M 199 GiB 0.08 79 TiB cephfs10_data 21 512 13 TiB 100.06M 39 TiB 14.10 79 TiB cephfs10_metadata 22 32 61 GiB 12.12M 183 GiB 0.08 79 TiB cephfs11_data 23 512 11 TiB 107.09M 34 TiB 12.48 79 TiB cephfs11_metadata 24 32 65 GiB 12.37M 194 GiB 0.08 79 TiB cephfs12_data 25 512 14 TiB 105.16M 44 TiB 15.49 79 TiB cephfs12_metadata 26 32 64 GiB 12.45M 192 GiB 0.08 79 TiB cephfs13_data 27 512 18 TiB 107.92M 55 TiB 18.86 79 TiB cephfs13_metadata 28 32 65 GiB 12.59M 194 GiB 0.08 79 TiB cephfs14_data 29 512 13 TiB 93.04M 40 TiB 14.36 79 TiB cephfs14_metadata 30 32 56 GiB 11.23M 169 GiB 0.07 79 TiB cephfs15_data 31 512 9.7 TiB 98.24M 30 TiB 11.09 79 TiB cephfs15_metadata 32 32 60 GiB 11.61M 179 GiB 0.07 79 TiB cephfs16_data 33 512 7.0 TiB 68.55M 21 TiB 8.29 79 TiB cephfs16_metadata 34 32 43 GiB 9.64M 129 GiB 0.05 79 TiB cephfs17_data 35 512 5.6 TiB 65.69M 17 TiB 6.76 79 TiB cephfs17_metadata 36 32 41 GiB 8.78M 123 GiB 0.05 79 TiB cephfs18_data 37 512 6.7 TiB 76.94M 21 TiB 7.99 79 TiB cephfs18_metadata 38 32 53 GiB 16.07M 159 GiB 0.07 79 TiB cephfs19_data 39 512 12 TiB 65.40M 36 TiB 13.12 79 TiB cephfs19_metadata 40 32 39 GiB 8.08M 118 GiB 0.05 79 TiB cephfs20_data 41 512 7.6 TiB 65.21M 23 TiB 8.93 79 TiB cephfs20_metadata 42 32 40 GiB 8.10M 120 GiB 0.05 79 TiB
This cluster has 176x 7.68TB OSDs, but even with upmap_max_deviation set to 3 we're seeing quite the spread in disk usage:
Least full OSD: 50.23% (156 PGs)
Most full OSD: 70.19% (214 PGs)
I've attached the output of 'osdmaptool <your osdmap> --test-map-pgs-dump --pool <pool id>' for each of the cephfs data pools.