Bug #40791: high variance in pg size - RADOS - Ceph

Actions

Copy link

Bug #40791

closed

high variance in pg size

Added by Jan Fajerski almost 5 years ago. Updated over 4 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Jan Fajerski

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We're seeing a cluster that has a history of being very unbalanced in terms of OSD utilisation. The balancer in upmap mode was turned on which got the pg count perfectly balanced, however the utilsation is varies considerably.

Looking at some OSDs it seems that some PGs have very close to two times the number of objects than others.

A slice of a processed PG dump from one OSD ordered by number of objects:

33705 ["2.82es0" 
33660 ["2.f34s3" 
33574 ["2.c63s9" 
33559 ["2.fe7s7" 
33558 ["2.f6fs2" 
33499 ["2.ebcs10" 
17245 ["2.3ds11" 
17227 ["2.1076s3" 
17217 ["2.68bs6" 
17183 ["2.34s8" 
17178 ["2.6f3s2" 
17167 ["2.211s2"

The workload is VMs on rbd images backed by an erasure coded pool. The VM images are cloned from one master rbd image snapshot.

Initially the pg count of the (EC) data pool was deemed much too low and was increased (doubled iirc).

I don't have direkt acces to the cluster but I'm trying to get a object listing for two PGs, say 2.ebcs10 and 2.3ds11 (cp . above).

Actions

Copy link

Updated by Neha Ojha almost 5 years ago

Status changed from New to Need More Info

Which ceph version are you using?

Actions

Copy link

Updated by Greg Farnum almost 5 years ago

Assignee set to Jan Fajerski

It sure looks like the PG count isn't a power of two, so some of them are simply half size compared to the others. (Since they cluster around ~17200 and 33500 objects — although I'm surprised it's not exactly doubled).

Depending on the major release you may be able to switch the balancer into balancing based on bytes rather than PG count; it's a configurable and the defaults and implementations have changed a bit.

Actions

Copy link

Updated by Lars Marowsky-Brée almost 5 years ago

This is Luminous, 12.2.12 by now.

Balancing on bytes (reweight-by-utilization) was unable to resolve the issue previously. When the clients were confirmed to be recent enough, we were going for the upmap balancer in the hope of this helping. And we got perfectly balanced pg counts, unfortunately the PGs themselves are really quite uneven.

Actions

Copy link

Updated by Jan Fajerski almost 5 years ago

Greg Farnum wrote:

It sure looks like the PG count isn't a power of two, so some of them are simply half size compared to the others. (Since they cluster around ~17200 and 33500 objects — although I'm surprised it's not exactly doubled).

My last info is that the pool in question (id 2) has 4096 PGs.

Actions

Copy link

Updated by Jan Fajerski almost 5 years ago

Jan Fajerski wrote:

Greg Farnum wrote:

It sure looks like the PG count isn't a power of two, so some of them are simply half size compared to the others. (Since they cluster around ~17200 and 33500 objects — although I'm surprised it's not exactly doubled).

My last info is that the pool in question (id 2) has 4096 PGs.

Correction: The pool in question has 6144 PGs.

I'm however surprised how large an imbalance this cluster shows. Again, PGs are perfectly balanced through the upmap balancer. Regardless the least used OSD uses 1.61TiB or 45.96% of its capacity, whereas the most heavily used OSD occupies 2.60TiB or 74.31% of its capacity.
Is the impact of not using a power of two for the PG count really that high?

Actions

Copy link

Updated by Greg Farnum almost 5 years ago

Lars Marowsky-Brée wrote:

This is Luminous, 12.2.12 by now.

Balancing on bytes (reweight-by-utilization) was unable to resolve the issue previously. When the clients were confirmed to be recent enough, we were going for the upmap balancer in the hope of this helping. And we got perfectly balanced pg counts, unfortunately the PGs themselves are really quite uneven.

The reweight-by-utilization command has always been pretty weak and is not what I meant: the upmap balancer (in some releases) has a mode that balances for bytes rather than PG counts.

Jan Fajerski wrote:

Jan Fajerski wrote:

Greg Farnum wrote:

It sure looks like the PG count isn't a power of two, so some of them are simply half size compared to the others. (Since they cluster around ~17200 and 33500 objects — although I'm surprised it's not exactly doubled).

My last info is that the pool in question (id 2) has 4096 PGs.

Correction: The pool in question has 6144 PGs.

I'm however surprised how large an imbalance this cluster shows. Again, PGs are perfectly balanced through the upmap balancer. Regardless the least used OSD uses 1.61TiB or 45.96% of its capacity, whereas the most heavily used OSD occupies 2.60TiB or 74.31% of its capacity.
Is the impact of not using a power of two for the PG count really that high?

PGs split by splitting their hash range in half. So if you have not-a-power-of-two, some of the PGs are twice the size of the other. Our expectation in the past is that this shouldn't be an issue in normal use, and it mostly isn't, but it does make some difference. It's possible the balancer has some behavior that exacerbates things though.

Actions

Copy link

Updated by Jan Fajerski over 4 years ago

Greg Farnum wrote:

PGs split by splitting their hash range in half. So if you have not-a-power-of-two, some of the PGs are twice the size of the other. Our expectation in the past is that this shouldn't be an issue in normal use, and it mostly isn't, but it does make some difference. It's possible the balancer has some behavior that exacerbates things though.

I was wondering if this could also be exasperated by the erasure coding setup. They run 10+2 with failure domain host and have 17 hosts. Could this create a situation where the larger and smaller PGs are not (almost) randomly distributed to all OSDs but the distributions ends up being somewhat regular? Basically does a ruleset like this reduce the potential for randomness of crushs distribution?

In any case, given the significant imbalance this can produce, should we warn more explicitly in the docs against using pg counts that are not a power of two?

Actions

Copy link

Updated by Greg Farnum over 4 years ago

Jan Fajerski wrote:

Greg Farnum wrote:

PGs split by splitting their hash range in half. So if you have not-a-power-of-two, some of the PGs are twice the size of the other. Our expectation in the past is that this shouldn't be an issue in normal use, and it mostly isn't, but it does make some difference. It's possible the balancer has some behavior that exacerbates things though.

I was wondering if this could also be exasperated by the erasure coding setup. They run 10+2 with failure domain host and have 17 hosts. Could this create a situation where the larger and smaller PGs are not (almost) randomly distributed to all OSDs but the distributions ends up being somewhat regular? Basically does a ruleset like this reduce the potential for randomness of crushs distribution?

I don't think anybody's looked at this in depth. A lot of things look very different when the "branch factor" is not small compared to the number of items in a CRUSH bucket, so it's definitely possible

In any case, given the significant imbalance this can produce, should we warn more explicitly in the docs against using pg counts that are not a power of two?

I haven't looked at it in a while but I was comfortable with the statements we made back when last looking at this. Mark Nelson has looked at this a lot and is probably your ally if you want to look at it more though. :)

Actions

Copy link