Project

General

Profile

Actions

Bug #64010

open

Errors in Prometheus Queries due to Duplicate Series in Monitoring > Ceph Mixin

Added by soham bharambe 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitoring/Alerting
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description:
We are encountering errors in several Prometheus queries in our Ceph cluster. These queries are coming directly from the `monitoring/ceph-mixin` in the Ceph repository. The errors seem to be caused by duplicate series for certain match groups, which lead to 'many-to-many matching not allowed' issues.

Affected Queries and Errors:

  1. Query:
     
    ceph_pool_metadata * on (pool_id, instance) group_left () (ceph_pg_total - ceph_pg_active) > 0
        

    Error:
    Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the right hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
        
  2. Query:
    ceph_pool_metadata * on (pool_id, instance) group_left () (ceph_pg_total - ceph_pg_clean) > 0
        

    Error:
    Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the right hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
        
  3. Query:
    (rate(ceph_osd_up[5m]) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata) * 60 > 1
        

    Error:
    Error executing query: found duplicate series for the match group {ceph_daemon="osd.0"} on the right hand-side of the operation: [{__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="rook-ceph-mgr"}, {__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
        
  4. Query:
    abs(((ceph_osd_numpg > 0) - on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) / on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata > 0.3
        

    Error:
    Error executing query: found duplicate series for the match group {ceph_daemon="osd.0"} on the right hand-side of the operation: [{__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="rook-ceph-mgr"}, {__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
        
  5. Query:
    (predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on (pool_id, instance) group_right () ceph_pool_metadata) >= 95
        

    Error:
    Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the left hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
        

Investigation and Findings:
Upon further investigation, I found that `count(ceph_pool_metadata) by (pool_id, instance)` returns a count of 2 for all entries, indicating multiple groups of `pool_id` and `instance`.

Potential Solution:
Including `job` in the query appears to resolve this issue. For example:

ceph_pool_metadata * on (pool_id, instance, job) group_left () (ceph_pg_total - ceph_pg_clean) > 0

This modification seems to address the problem, as we have two `job` labels (`ceph` and `rook-ceph-mgr`) with all other parameters being the same. This is likely causing the error where duplicate copies are returned from each `job` instead of a single result.

Request:
I would appreciate guidance on how to permanently resolve these query errors. Adjusting the queries to account for the `job` or `service` label might be necessary.

Actions

Also available in: Atom PDF