Project

General

Profile

Actions

Bug #61824

open

mgr/prometheus: Prometheus metrics type counter decreasing

Added by Jonas Nemeikšis 10 months ago. Updated 9 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We've added a few osd nodes to the cluster, when finished backfills we've noticed abnormal metrics. The type of metrics are counters but values are decreasing

The affected metrics are:

ceph_pool_rd
ceph_pool_rd_bytes
ceph_pool_wr
ceph_pool_wr_bytes


Files

Actions #1

Updated by Jonas Nemeikšis 10 months ago

Seems like rados df wrong provides WR_OPS, WR_OPS:

[root@mon1 ~]# rados df -p default.rgw.buckets.data
POOL_NAME                      USED    OBJECTS  CLONES      COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED       RD_OPS       RD      WR_OPS       WR  USED COMPR  UNDER COMPR
default.rgw.buckets.data    1.0 PiB  221738020       0  1108690100                   0        0         0   3766935322   79 TiB  2722274766  1.1 PiB         0 B          0 B

After some minutes WR_OPS decreased

[root@mon1 ~]# rados df -p default.rgw.buckets.data
POOL_NAME                      USED    OBJECTS  CLONES      COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED       RD_OPS       RD      WR_OPS       WR  USED COMPR  UNDER COMPR
default.rgw.buckets.data    1.0 PiB  221571530       0  1107857650                   0        0         0   3763107878   79 TiB  2714576718  1.1 PiB         0 B          0 B

Is it a known bug?

Actions #2

Updated by Radoslaw Zarzynski 10 months ago

  • Project changed from mgr to RADOS
  • Category deleted (prometheus module)

Looks like a RADOS bug. Reassigning.

Actions #3

Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from New to Need More Info

I was looking for num_rd_kb in the OSD code. It looks to me it almost never goes down but there is the logic in split:

  void split(std::vector<object_stat_sum_t> &out) const {
#define SPLIT(PARAM)                            \
    for (unsigned i = 0; i < out.size(); ++i) { \
      out[i].PARAM = PARAM / out.size();        \
      if (i < (PARAM % out.size())) {           \
        out[i].PARAM++;                         \
      }                                         \
    }
#define SPLIT_PRESERVE_NONZERO(PARAM)           \
    for (unsigned i = 0; i < out.size(); ++i) { \
      if (PARAM)                                \
        out[i].PARAM = 1 + PARAM / out.size();  \
      else                                      \
        out[i].PARAM = 0;                       \
    }

Do you have the autoscaler turned on? Do the freshly added OSDs map into the affected pool? Is the issue reproducible?

Updated by Jonas Nemeikšis 9 months ago

We've turned off the autoscaler in all clusters. Yes, freshly added OSDs map into the affected pool. Maybe it is related that we've added OSDs and then increased pg_num on pool?

The issue seems reproducible when increasing pg_num. For now, one cluster backfilling without increased pg_num. I will come back with information after  backfills.

I've added screenshots, backfills done and metrics then wrong, but after some days metrics back to normal(decreasing value little by little) 

Actions #5

Updated by Jonas Nemeikšis 9 months ago

Yes, I can confirm that metrics went wrong only in this scenario:

1. Added new osds nodes to pool
2. Increased pg_num
3, Done backfills
4. Wrong metrics ~2 days (attached screenshot)

It is not a critical, but why two days metrics abnormal?

Actions

Also available in: Atom PDF