Project

General

Profile

Actions

Fix #59272

open

Warning "1 pools have many more objects per pg.." should not be triggered on nearly empty clusters

Added by Christian Huebner about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
common
Target version:
-
% Done:

0%

Source:
Tags:
pgpmap placement openstack
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have had multiple customer complaints about the warning "1 pools have many more objects per pg than average" and the cluster being in HEALTH_WARN on newly deployed clusters.

There is no value in this warning if the cluster is not already well populated and the autoscaler can work properly. In this case the pool referred to in the warning has 20 objects per PG and the others have 2. This is not a good impression Ceph is giving customers. Even on an empty cluster the cluster should be able to become healthy.

This problem has existed for many versions now and should be addressed. I searched for similar tickets, but could not find anything.

I had a look at the source code and found the warning is triggered only in one place in PGPMap.cc
Here are the conditions under which the warning is triggered:

        if (mon_pg_warn_max_object_skew > 0 &&
            ratio > mon_pg_warn_max_object_skew) {
          ostringstream ss;
          if (pi->pg_autoscale_mode != pg_pool_t::pg_autoscale_mode_t::ON) {

If I understand this right the warning is supposed to be triggered if the skew parameter is greater than zero AND the actual ratio is greater than the skew warning ratio AND the autoscale mode for the pool is not pg_pool_t::pg_autoscale_mode_t::ON (autoscale in warn or off), but this does not seem to work right. Autoscale was on for all pools in several cases where I investigated this.

But even if this mechanism worked as designed, it still would only cover part of the cases. Some customers don't want the autoscaler on, and they should not get the nuisance warning on an almost empty cluster either.

So I propos is to trigger this warning only above a threshold of objects per PG (to be determined, maybe 1000 objects per PG), or to at least not trigger HEALTH_WARN if this warning is given below this threshold.

Additional info: This happens regularly on Ceph clusters that are used for OpenStack. On a new cluster, the customer starts uploading Glance images and the Glance pool grows with more objects per PG, while the Nova and Cinder pools are still empty as the cloud has only a few test VMs yet. The autoscaler is configured correctly, but does not help in this case. Setting the mon_pg_warn_max_object_skew is a band aid and prone to being forgotten.

No data to display

Actions

Also available in: Atom PDF