Project

General

Profile

Actions

Bug #64715

open

osdmap offline optimization and balancer should take bluestore_min_alloc_size into account

Added by Alexander Patrakov about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSDMap
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have access to a cluster and tried to perform offline OSDmap optimization as described here: https://docs.ceph.com/en/latest/rados/operations/upmap/#offline-optimization

Well, in retrospect, I could also let the balancer do its job, with the same result. And the result is that some HDDs are 78% full, while others are 63-65% full, with nothing in between, which is obviously not good.

Today this was root-caused. This is a quite old cluster, initially deployed when the default bluestore_min_alloc_size for HDDs was 65536 and then expanded. New OSDs have bluestore_min_alloc_size=4096.

It turns out that there is a significant overhead from the large value of bluestore_min_alloc_size, and the balancer does not take it into account. For example, an OSD that has a total of 56613739 objects in all PGs would have 1.7 TB of overhead with bluestore_min_alloc_size=65536, but only 100 GB of overhead with bluestore_min_alloc_size=4096.

I would expect the balancer to take bluestore_min_alloc_size into account as follows:

1. Figure out the number of objects on each OSD (total in all pools), either in the current situation or assuming perfect PG balance in all pools - I am not sure which way is better.
2. Calculate the overhead on every OSD by multiplying the number of objects by bluestore_min_alloc_size/2.
3. Effectively subtract the overheads from the OSD CRUSH weights. <-- This is the missing step
4. Do whatever is currently done to improve the balance, but with effectively adjusted CRUSH weights.

I have not, however, checked that it will work. However, I did check that the overhead calculated this way approximately matches the difference in the fill levels of the old and new OSDs.

I will attach a few files that demonstrate the issue, but please note that I am not allowed to publicly post the full osdmap - or in fact anything that contains pool names, non-generic server names, or UUIDs and thus can be used to identify the cluster.

  • ceph osd df (please ignore the first bunch of OSDs with only 0.75% utilization - they are outside of the CRUSH root, waiting for an "ok" to be placed in the proper hierarchy)
  • ceph pg ls-by-osd 221
  • ceph pg ls-by-osd 223

osd.221 was recently redeployed with 4K bluestore_min_alloc_size, while osd.223 still has 64K. Note that, despite virtually identical PG composition (just one PG differs), the used space differs by 1.9 TB out of 14.

My point here is that, because of the overhead, the ideal number of PGs on osd.221 and osd.223 should be different.


Files

ceph-pg-ls-by-osd-221.txt (22.9 KB) ceph-pg-ls-by-osd-221.txt ceph pg ls-by-osd 221 Alexander Patrakov, 03/05/2024 12:15 PM
ceph-pg-ls-by-osd-223.txt (23.5 KB) ceph-pg-ls-by-osd-223.txt ceph pg ls-by-osd 223 Alexander Patrakov, 03/05/2024 12:15 PM
ceph-osd-df.txt (64.7 KB) ceph-osd-df.txt ceph osd df Alexander Patrakov, 03/05/2024 12:15 PM

No data to display

Actions

Also available in: Atom PDF