Bug #61824: mgr/prometheus: Prometheus metrics type counter decreasing - RADOS - Ceph

Actions

Copy link

Bug #61824

open

mgr/prometheus: Prometheus metrics type counter decreasing

Added by Jonas Nemeikšis 10 months ago. Updated 9 months ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.13

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We've added a few osd nodes to the cluster, when finished backfills we've noticed abnormal metrics. The type of metrics are counters but values are decreasing

The affected metrics are:

ceph_pool_rd
ceph_pool_rd_bytes
ceph_pool_wr
ceph_pool_wr_bytes

Files

Download all files

Screenshot 2023-06-27 at 10.55.59.png (104 KB) Screenshot 2023-06-27 at 10.55.59.png		Jonas Nemeikšis, 06/27/2023 07:57 AM
Screenshot 2023-06-27 at 10.56.24.png (270 KB) Screenshot 2023-06-27 at 10.56.24.png		Jonas Nemeikšis, 06/27/2023 07:57 AM
Screenshot 2023-07-31 at 21.52.33.png (180 KB) Screenshot 2023-07-31 at 21.52.33.png		Jonas Nemeikšis, 07/31/2023 07:02 PM
Screenshot 2023-07-31 at 21.54.21.png (381 KB) Screenshot 2023-07-31 at 21.54.21.png		Jonas Nemeikšis, 07/31/2023 07:02 PM
Screenshot 2023-08-02 at 08.36.01.png (467 KB) Screenshot 2023-08-02 at 08.36.01.png		Jonas Nemeikšis, 08/02/2023 05:39 AM

Actions

Copy link

Updated by Jonas Nemeikšis 10 months ago

Seems like rados df wrong provides WR_OPS, WR_OPS:

[root@mon1 ~]# rados df -p default.rgw.buckets.data
POOL_NAME                      USED    OBJECTS  CLONES      COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED       RD_OPS       RD      WR_OPS       WR  USED COMPR  UNDER COMPR
default.rgw.buckets.data    1.0 PiB  221738020       0  1108690100                   0        0         0   3766935322   79 TiB  2722274766  1.1 PiB         0 B          0 B

After some minutes WR_OPS decreased

[root@mon1 ~]# rados df -p default.rgw.buckets.data
POOL_NAME                      USED    OBJECTS  CLONES      COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED       RD_OPS       RD      WR_OPS       WR  USED COMPR  UNDER COMPR
default.rgw.buckets.data    1.0 PiB  221571530       0  1107857650                   0        0         0   3763107878   79 TiB  2714576718  1.1 PiB         0 B          0 B

Is it a known bug?

Actions

Copy link

Updated by Radoslaw Zarzynski 10 months ago

Project changed from mgr to RADOS
Category deleted (~~prometheus module~~)

Looks like a RADOS bug. Reassigning.

Actions

Copy link

Updated by Radoslaw Zarzynski 9 months ago

Status changed from New to Need More Info

I was looking for num_rd_kb in the OSD code. It looks to me it almost never goes down but there is the logic in split:

  void split(std::vector<object_stat_sum_t> &out) const {
#define SPLIT(PARAM)                            \
    for (unsigned i = 0; i < out.size(); ++i) { \
      out[i].PARAM = PARAM / out.size();        \
      if (i < (PARAM % out.size())) {           \
        out[i].PARAM++;                         \
      }                                         \
    }
#define SPLIT_PRESERVE_NONZERO(PARAM)           \
    for (unsigned i = 0; i < out.size(); ++i) { \
      if (PARAM)                                \
        out[i].PARAM = 1 + PARAM / out.size();  \
      else                                      \
        out[i].PARAM = 0;                       \
    }

Do you have the autoscaler turned on? Do the freshly added OSDs map into the affected pool? Is the issue reproducible?

Actions

Copy link Download all files

Updated by Jonas Nemeikšis 9 months ago

File Screenshot 2023-07-31 at 21.52.33.png Screenshot 2023-07-31 at 21.52.33.png added
File Screenshot 2023-07-31 at 21.54.21.png Screenshot 2023-07-31 at 21.54.21.png added

We've turned off the autoscaler in all clusters. Yes, freshly added OSDs map into the affected pool. Maybe it is related that we've added OSDs and then increased pg_num on pool?

The issue seems reproducible when increasing pg_num. For now, one cluster backfilling without increased pg_num. I will come back with information after backfills.

I've added screenshots, backfills done and metrics then wrong, but after some days metrics back to normal(decreasing value little by little)

Actions

Copy link