Bug #64010
openErrors in Prometheus Queries due to Duplicate Series in Monitoring > Ceph Mixin
0%
Description
Description:
We are encountering errors in several Prometheus queries in our Ceph cluster. These queries are coming directly from the `monitoring/ceph-mixin` in the Ceph repository. The errors seem to be caused by duplicate series for certain match groups, which lead to 'many-to-many matching not allowed' issues.
Affected Queries and Errors:
- Query:
ceph_pool_metadata * on (pool_id, instance) group_left () (ceph_pg_total - ceph_pg_active) > 0
Error:
Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the right hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
- Query:
ceph_pool_metadata * on (pool_id, instance) group_left () (ceph_pg_total - ceph_pg_clean) > 0
Error:
Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the right hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
- Query:
(rate(ceph_osd_up[5m]) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata) * 60 > 1
Error:
Error executing query: found duplicate series for the match group {ceph_daemon="osd.0"} on the right hand-side of the operation: [{__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="rook-ceph-mgr"}, {__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
- Query:
abs(((ceph_osd_numpg > 0) - on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) / on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata > 0.3
Error:
Error executing query: found duplicate series for the match group {ceph_daemon="osd.0"} on the right hand-side of the operation: [{__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="rook-ceph-mgr"}, {__name__="ceph_osd_metadata", ceph_daemon="osd.0", ceph_version="ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)", cluster_addr="172.20.157.180", container="mgr", device_class="hdd", endpoint="http-metrics", hostname="htzhel1-ax41b.enableit.dk", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", objectstore="bluestore", pod="rook-ceph-mgr-a-9666d9c49-srvcl", public_addr="172.20.157.180", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
- Query:
(predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on (pool_id, instance) group_right () ceph_pool_metadata) >= 95
Error:
Error executing query: found duplicate series for the match group {instance="172.20.200.229:9283", pool_id="1"} on the left hand-side of the operation: [{container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="rook-ceph-mgr", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="rook-ceph-mgr"}, {container="mgr", endpoint="http-metrics", instance="172.20.200.229:9283", job="ceph", namespace="rook-ceph", pod="rook-ceph-mgr-a-9666d9c49-srvcl", pool_id="1", service="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side
Investigation and Findings:
Upon further investigation, I found that `count(ceph_pool_metadata) by (pool_id, instance)` returns a count of 2 for all entries, indicating multiple groups of `pool_id` and `instance`.
Potential Solution:
Including `job` in the query appears to resolve this issue. For example:
ceph_pool_metadata * on (pool_id, instance, job) group_left () (ceph_pg_total - ceph_pg_clean) > 0
This modification seems to address the problem, as we have two `job` labels (`ceph` and `rook-ceph-mgr`) with all other parameters being the same. This is likely causing the error where duplicate copies are returned from each `job` instead of a single result.
Request:
I would appreciate guidance on how to permanently resolve these query errors. Adjusting the queries to account for the `job` or `service` label might be necessary.
Updated by soham bharambe 4 months ago
This is the prometheus_alerts.yml file with queries: https://github.com/ceph/ceph/blob/49c27499af4ee9a90f69fcc6bf3597999d6efc7b/monitoring/ceph-mixin/prometheus_alerts.yml from the repo: https://github.com/ceph/ceph