Project

General

Profile

Bug #23167

mgr: prometheus: ceph_pg metrics reported by prometheus plugin inconsistent with "ceph -s" output

Added by Subhachandra Chandra about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was observed in a cluster running 12.2.4.

When a host went down, the following PG counts were reported by "ceph -s". Around 28939 were reported as "active+clean". The corresponding numbers from the Prometheus plugin
are reported as

ceph_pg_active 326.0
ceph_pg_clean 30379.0

Note that there was a slight delay between the two data points and so some numbers did change. However, Prometheus reports only 326 PGs as active while the number of clean PGs is closer to what "ceph -s" displays. The active count seems to be wrong. When the cluster went back to its normal state the active and clean counts match.

data:
pools: 2 pools, 33280 pgs
objects: 4907 objects, 1265 GB
usage: 3096 GB used, 3926 TB / 3929 TB avail
pgs: 1.388% pgs not active
574/44163 objects degraded (1.300%)
28939 active+clean
3344 active+undersized
512 active+undersized+degraded
376 peering
85 activating
23 active+recovering+degraded
1 activating+degraded

ceph_pg_incomplete 0.0
ceph_pg_degraded 326.0
ceph_pg_forced_backfill 0.0
ceph_pg_stale 0.0
ceph_pg_undersized 326.0
ceph_pg_peering 168.0
ceph_pg_inconsistent 0.0
ceph_pg_forced_recovery 0.0
ceph_pg_creating 0.0
ceph_pg_wait_backfill 0.0
ceph_pg_active 326.0
ceph_pg_deep 0.0
ceph_pg_scrubbing 0.0
ceph_pg_recovering 22.0
ceph_pg_repair 0.0
ceph_pg_down 0.0
ceph_pg_peered 0.0
ceph_pg_backfill 0.0
ceph_pg_clean 30379.0
ceph_pg_remapped 0.0
ceph_pg_backfill_toofull 0.0

Normal state
- - - - - -

data:
pools: 2 pools, 33280 pgs
objects: 5831 objects, 1503 GB
usage: 3451 GB used, 3926 TB / 3929 TB avail
pgs: 33280 active+clean

ceph_pg_incomplete 0.0
ceph_pg_degraded 0.0
ceph_pg_forced_backfill 0.0
ceph_pg_stale 0.0
ceph_pg_undersized 0.0
ceph_pg_peering 0.0
ceph_pg_inconsistent 0.0
ceph_pg_forced_recovery 0.0
ceph_pg_creating 0.0
ceph_pg_wait_backfill 0.0
ceph_pg_active 33280.0
ceph_pg_deep 0.0
ceph_pg_scrubbing 0.0
ceph_pg_recovering 0.0
ceph_pg_repair 0.0
ceph_pg_down 0.0
ceph_pg_peered 0.0
ceph_pg_backfill 0.0
ceph_pg_clean 33280.0
ceph_pg_remapped 0.0
ceph_pg_backfill_toofull 0.0

History

#1 Updated by John Spray about 6 years ago

  • Assignee set to Boris Ranto

This is probably the same thing that was fixed in master in this commit:

commit 6cefd4832f59b6196f27769a1ec4934329547da9
Author: Boris Ranto <branto@redhat.com>
Date:   Fri Feb 16 18:45:58 2018 +0100

    mgr/prometheus: Fix pg_* counts

    Currently, the pg_* counts are not computed properly. We split the
    current state by '+' sign but do not add the pg count to the already
    found pg count. Instead, we overwrite any existing pg count with the new
    count. This patch fixes it by adding all the pg counts together for all
    the states.

    It also introduces a new pg_total metric for pg_total that shows the
    total count of PGs.

    Signed-off-by: Boris Ranto <branto@redhat.com>

Boris, please could you look at this and backport if necessary?

#2 Updated by Boris Ranto about 6 years ago

Yes, this tracker covers what I have been seeing with the pg metrics. I have included the commit in this prometheus exporter backport PR:

https://github.com/ceph/ceph/pull/20642

#3 Updated by John Spray about 6 years ago

  • Status changed from New to Fix Under Review

#5 Updated by Nathan Cutler almost 6 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF