Tasks #53622: User + Dev Monthly Meeting: Collect osdmaps for balancer testing - mgr - Ceph

Actions

Copy link

Tasks #53622

open

User + Dev Monthly Meeting: Collect osdmaps for balancer testing

Added by Laura Flores over 2 years ago. Updated almost 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

balancer module

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

Forthcoming balancer improvements could use more osdmaps for testing purposes.
If you would like to contribute yours, you may

1. attach your osdmap binary files (1000 KB max) to this Tracker ticket

2. use https://docs.ceph.com/en/latest/man/8/ceph-post-file/

Files

Download all files

nethub.osdmap (726 KB) nethub.osdmap		Enrico Bocchi, 12/17/2021 04:23 PM
osd_loop.map (372 KB) osd_loop.map	osd.map	Konstantin Shalygin, 12/17/2021 07:25 PM
osdmap.7z (942 KB) osdmap.7z	log (106MiB)	Konstantin Shalygin, 12/17/2021 07:26 PM
osdmap-godaddy-20220721-pools.tar.xz (127 KB) osdmap-godaddy-20220721-pools.tar.xz	test-map-pgs-dump for each of the cephfs data pools	Bryan Stillwell, 07/22/2022 05:31 PM

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Project changed from Ceph to mgr
Subject changed from User + Dev Monthly Meeting: Collect osdmaps for testing purposes to User + Dev Monthly Meeting: Collect osdmaps for balancer testing
Description updated (diff)
Category set to balancer module

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Enrico Bocchi over 2 years ago

File nethub.osdmap nethub.osdmap added

The attached osdmap is from a cluster unable to rebalance after the addition of one rack (4 hosts) of new hardware with a much higher weight than existing racks. The cluster is used for S3 storage and uses an erasure-coded (4+2) pool for bucket data with failure domain set to rack. The new rack resulted in having a crush weight higher than 1/6 of the overall weight (0.278 vs 0.166), hence its full capacity could not have been used due to the 4+2 EC and crush failure domain constraints.

The osd tree -1 8886.22266 -15 8886.22266 -77 699.35577 -76 174.83894 -142 174.83894 -106 174.83894 -151 174.83894 -29 698.62939 -28 174.65735 -154 174.65735 -145 174.65735 -115 174.65735 -131 120.00000 -130 120.00000 -158 2472.74707 -157 618.18677 -163 618.18677 -3 618.18677 -6 618.18677 [...cut...]
with:

root default room 0773-R-0402 rack HA06 host cephnethub-data-08e0542d9e host cephnethub-data-32672f0985 host cephnethub-data-62ead38716 host cephnethub-data-8b97064fe7 rack HA07 host cephnethub-data-0509dffff2 host cephnethub-data-8033235189 host cephnethub-data-a1951b6acc host cephnethub-data-e2aedd6c61 rack HA08 host cephnethub-data-f1b6566110 rack HA09 host cephnethub-data21-0f56b71981 host cephnethub-data21-69101c98c8 host cephnethub-data21-becafc51ac host cephnethub-data21-db46980a1c />- HA06 being one of the racks with old hardware
- HA09 being the rack with new hardware and much higher weight
- other 7 racks of old hardware not shown. Crush weight is the same as HA06

Under these circumstances, the balancer was struggling to build a plan and was not successful in moving data away from the most full osds.

Actions

Copy link

Updated by Andras Pataki over 2 years ago

On our large cluster, the whole upmap balancing mechanism fails - looks like a bug in the ceph C++ code (not the mgr balancer module).
Here is the tracker (which includes an osd map): https://tracker.ceph.com/issues/51729
As a consequence I am using a completely different balancing mechanism that relies of crush weights only (essentially running a gradient descent optimizer simulating crush maps for proposed crush weight changes).

Actions

Copy link Download all files

Updated by Konstantin Shalygin over 2 years ago

File osd_loop.map osd_loop.map added
File osdmap.7z osdmap.7z added

With this map osdmaptool (Nautilus 14.2.22) loop over the:

2021-12-17 22:16:28.247 7f00afeccc40 10  skipping overfull
2021-12-17 22:16:28.247 7f00afeccc40 10  failed to find any changes for overfull osds
2021-12-17 22:16:28.259 7f00afeccc40 10  will try dropping existing remapping pair 567 -> 456 which remapped 3.1229 out from underfull osd.567
2021-12-17 22:16:28.259 7f00afeccc40 10  existing pg_upmap_items [567,456] remapped 3.1229 out from underfull osd.567, will try cancelling it entirely

The command is:

osdmaptool osd.map --upmap-deviation 1 --upmap-max 10000 --upmap upmap.sh --debug_osd=20

Map attached,
Thanks

Actions

Copy link

Updated by Jake Grimmett over 2 years ago

Two OSD maps uploaded from the LMB in Cambridge:

archive cephfs cluster
20 nodes, 5.0 PiB, 450x8TB HDD, 128 x 16TB HDD, 4 Intel Optane for MDS
ceph-post-file: e8bbc7bb-2f84-44dd-bd9e-26b24f6a4b13

Primary cephfs cluster
38 nodes, 4.6 PiB, 432x12TB HDD (+72 NVMe db/wal SSDs) 4 x Intel Optane OSD for MDS
ceph-post-file: 85ffe74b-6976-4c01-a55c-307cf5fde558

Many thanks

Jake

Actions

Copy link

Updated by Bryan Stillwell almost 2 years ago

File osdmap-godaddy-20220721-pools.tar.xz osdmap-godaddy-20220721-pools.tar.xz added

I'm seeing this problem on multiple clusters that each have 20 CephFS filesystems of varying sizes. Here is the data distribution on one of the clusters:

cephfs01_data           3  512   24 TiB   99.94M   72 TiB  23.33     79 TiB
cephfs01_metadata       4   32   61 GiB   12.22M  182 GiB   0.07     79 TiB
cephfs02_data           5  512   17 TiB   94.32M   51 TiB  17.66     79 TiB
cephfs02_metadata       6   32   58 GiB   11.72M  175 GiB   0.07     79 TiB
cephfs03_data           7  512   13 TiB   88.21M   40 TiB  14.57     79 TiB
cephfs03_metadata       8   32   55 GiB   11.99M  166 GiB   0.07     79 TiB
cephfs04_data           9  512   17 TiB   94.21M   51 TiB  17.70     79 TiB
cephfs04_metadata      10   32   61 GiB   12.02M  182 GiB   0.07     79 TiB
cephfs05_data          11  512   15 TiB   88.47M   45 TiB  15.99     79 TiB
cephfs05_metadata      12   32   55 GiB   11.75M  165 GiB   0.07     79 TiB
cephfs06_data          13  512   12 TiB   97.00M   37 TiB  13.53     79 TiB
cephfs06_metadata      14   32   60 GiB   11.71M  179 GiB   0.07     79 TiB
cephfs07_data          15  512   16 TiB   96.39M   50 TiB  17.28     79 TiB
cephfs07_metadata      16   32   60 GiB   12.50M  179 GiB   0.07     79 TiB
cephfs08_data          17  512   13 TiB   89.43M   40 TiB  14.41     79 TiB
cephfs08_metadata      18   32   56 GiB   12.01M  169 GiB   0.07     79 TiB
cephfs09_data          19  512   16 TiB  109.49M   49 TiB  17.25     79 TiB
cephfs09_metadata      20   32   66 GiB   13.24M  199 GiB   0.08     79 TiB
cephfs10_data          21  512   13 TiB  100.06M   39 TiB  14.10     79 TiB
cephfs10_metadata      22   32   61 GiB   12.12M  183 GiB   0.08     79 TiB
cephfs11_data          23  512   11 TiB  107.09M   34 TiB  12.48     79 TiB
cephfs11_metadata      24   32   65 GiB   12.37M  194 GiB   0.08     79 TiB
cephfs12_data          25  512   14 TiB  105.16M   44 TiB  15.49     79 TiB
cephfs12_metadata      26   32   64 GiB   12.45M  192 GiB   0.08     79 TiB
cephfs13_data          27  512   18 TiB  107.92M   55 TiB  18.86     79 TiB
cephfs13_metadata      28   32   65 GiB   12.59M  194 GiB   0.08     79 TiB
cephfs14_data          29  512   13 TiB   93.04M   40 TiB  14.36     79 TiB
cephfs14_metadata      30   32   56 GiB   11.23M  169 GiB   0.07     79 TiB
cephfs15_data          31  512  9.7 TiB   98.24M   30 TiB  11.09     79 TiB
cephfs15_metadata      32   32   60 GiB   11.61M  179 GiB   0.07     79 TiB
cephfs16_data          33  512  7.0 TiB   68.55M   21 TiB   8.29     79 TiB
cephfs16_metadata      34   32   43 GiB    9.64M  129 GiB   0.05     79 TiB
cephfs17_data          35  512  5.6 TiB   65.69M   17 TiB   6.76     79 TiB
cephfs17_metadata      36   32   41 GiB    8.78M  123 GiB   0.05     79 TiB
cephfs18_data          37  512  6.7 TiB   76.94M   21 TiB   7.99     79 TiB
cephfs18_metadata      38   32   53 GiB   16.07M  159 GiB   0.07     79 TiB
cephfs19_data          39  512   12 TiB   65.40M   36 TiB  13.12     79 TiB
cephfs19_metadata      40   32   39 GiB    8.08M  118 GiB   0.05     79 TiB
cephfs20_data          41  512  7.6 TiB   65.21M   23 TiB   8.93     79 TiB
cephfs20_metadata      42   32   40 GiB    8.10M  120 GiB   0.05     79 TiB

This cluster has 176x 7.68TB OSDs, but even with upmap_max_deviation set to 3 we're seeing quite the spread in disk usage:

Least full OSD: 50.23% (156 PGs)
Most full OSD: 70.19% (214 PGs)

I've attached the output of 'osdmaptool <your osdmap> --test-map-pgs-dump --pool <pool id>' for each of the cephfs data pools.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Tasks #53622

User + Dev Monthly Meeting: Collect osdmaps for balancer testing

Updated by Neha Ojha over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Enrico Bocchi over 2 years ago

Updated by Andras Pataki over 2 years ago

Updated by Konstantin Shalygin over 2 years ago

Updated by Jake Grimmett over 2 years ago

Updated by Bryan Stillwell almost 2 years ago