Bug #64010: Errors in Prometheus Queries due to Duplicate Series in Monitoring > Ceph Mixin - Ceph - Ceph

Actions

Copy link

Bug #64010

open

Errors in Prometheus Queries due to Duplicate Series in Monitoring > Ceph Mixin

Added by soham bharambe 4 months ago. Updated 4 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Monitoring/Alerting

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Description:
We are encountering errors in several Prometheus queries in our Ceph cluster. These queries are coming directly from the `monitoring/ceph-mixin` in the Ceph repository. The errors seem to be caused by duplicate series for certain match groups, which lead to 'many-to-many matching not allowed' issues.

Affected Queries and Errors:

Query:

 
ceph_pool_metadata * on (pool_id, instance) group_left () (ceph_pg_total - ceph_pg_active) > 0

Error:

Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the right hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side

Query:

ceph_pool_metadata * on (pool_id, instance) group_left () (ceph_pg_total - ceph_pg_clean) > 0

Error:

Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the right hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side

Query:

(rate(ceph_osd_up[5m]) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata) * 60 > 1

Error:

Error executing query: found duplicate series for the match group {ceph_daemon="osd.0"} on the right hand-side of the operation: [{__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="rook-ceph-mgr"}, {__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side

Query:

abs(((ceph_osd_numpg > 0) - on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) / on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata > 0.3

Error:

Error executing query: found duplicate series for the match group {ceph_daemon="osd.0"} on the right hand-side of the operation: [{__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="rook-ceph-mgr"}, {__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side

Query:

(predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on (pool_id, instance) group_right () ceph_pool_metadata) >= 95

Error:

Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the left hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side

Investigation and Findings:
Upon further investigation, I found that `count(ceph_pool_metadata) by (pool_id, instance)` returns a count of 2 for all entries, indicating multiple groups of `pool_id` and `instance`.

Potential Solution:
Including `job` in the query appears to resolve this issue. For example:

ceph_pool_metadata * on (pool_id, instance, job) group_left () (ceph_pg_total - ceph_pg_clean) > 0

This modification seems to address the problem, as we have two `job` labels (`ceph` and `rook-ceph-mgr`) with all other parameters being the same. This is likely causing the error where duplicate copies are returned from each `job` instead of a single result.

Request:
I would appreciate guidance on how to permanently resolve these query errors. Adjusting the queries to account for the `job` or `service` label might be necessary.

Actions

Copy link

Updated by soham bharambe 4 months ago

This is the prometheus_alerts.yml file with queries: https://github.com/ceph/ceph/blob/49c27499af4ee9a90f69fcc6bf3597999d6efc7b/monitoring/ceph-mixin/prometheus_alerts.yml from the repo: https://github.com/ceph/ceph

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #64010

Errors in Prometheus Queries due to Duplicate Series in Monitoring > Ceph Mixin

Updated by soham bharambe 4 months ago