Actions
Bug #52714
openmgr/prometheus: Update ceph_healthcheck_* metric value to 1 when triggered
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We want to add more health check condition in here to monitor the cluster easily. (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L115)
HEALTH_CHECKS = [
alert_metric('SLOW_OPS', 'OSD or Monitor requests taking a long time to process'),
]
And the default value is 0 (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L558)
When some health warning is triggered, then it should be marked to 1( or another value not the default value 0), but it isn't because of here(https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L540)
I fixed this and tested our private cluster.
This is an example of ceph_exporter.
ceph set osd noscrub
# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 1.0
After ceph unset osd noscrub
# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 0.0
Actions