Bug #55029
openmgr/prometheus: ceph_mon_metadata is not consistently populating the ceph_version
0%
Description
Some users have been using the ceph_mon_metadata to determine whether there is a version mismatch within the cluster that needs to be resolved. This has been done with the mgr/prometheus data.
The issue they hit is that the ceph_version field is sometimes not populated, resulting in the alert firing erroneously.
ceph_mon_metadata{ceph_daemon="mon.a", ceph_version="ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)", container="mgr", endpoint="http-metrics", hostname="ip-10-163-144-114.eu-west-1.compute.internal", instance="10.129.4.23:9283", job="rook-ceph-mgr", namespace="openshift-storage", pod="rook-ceph-mgr-a-6cbdc85c66-x97xj", public_addr="172.30.78.135", rank="0", service="rook-ceph-mgr"}
ceph_mon_metadata{ceph_daemon="mon.b", ceph_version="ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)", container="mgr", endpoint="http-metrics", hostname="ip-10-163-144-182.eu-west-1.compute.internal", instance="10.129.4.23:9283", job="rook-ceph-mgr", namespace="openshift-storage", pod="rook-ceph-mgr-a-6cbdc85c66-x97xj", public_addr="172.30.71.229", rank="1", service="rook-ceph-mgr"}
ceph_mon_metadata{ceph_daemon="mon.c", container="mgr", endpoint="http-metrics", instance="10.129.4.23:9283", job="rook-ceph-mgr", namespace="openshift-storage", pod="rook-ceph-mgr-a-6cbdc85c66-x97xj", public_addr="172.30.100.211", rank="2", service="rook-ceph-mgr"}
This is the alert definition
count(count by(ceph_version) (ceph_mon_metadata{job="rook-ceph-mgr"})) > 1
A workaround for the problem changes the query to
count(count(ceph_mon_metadata{job="rook-ceph-mgr", ceph_version!=""}) by (ceph_version)) > 1
No data to display