Bug #52714
openmgr/prometheus: Update ceph_healthcheck_* metric value to 1 when triggered
0%
Description
We want to add more health check condition in here to monitor the cluster easily. (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L115)
HEALTH_CHECKS = [
alert_metric('SLOW_OPS', 'OSD or Monitor requests taking a long time to process'),
]
And the default value is 0 (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L558)
When some health warning is triggered, then it should be marked to 1( or another value not the default value 0), but it isn't because of here(https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L540)
I fixed this and tested our private cluster.
This is an example of ceph_exporter.
ceph set osd noscrub
# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 1.0
After ceph unset osd noscrub
# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 0.0
Updated by Paul Cuzner over 2 years ago
Take a look at this PR - https://github.com/ceph/ceph/pull/43293
It's adding tracking for all healthchecks, and introduces additional commands so you can see the cluster's healthcheck history over time.
For example.
[ceph: root@c8-node1 /]# ceph healthcheck history ls Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active MON_DISK_LOW 2021/10/06 20:37:36 2021/10/06 20:38:36 2 No 1 health check(s) listed [ceph: root@c8-node1 /]# ceph osd set noout noout is set [ceph: root@c8-node1 /]# ceph healthcheck history ls Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active MON_DISK_LOW 2021/10/06 20:37:36 2021/10/06 20:38:36 2 No OSDMAP_FLAGS 2021/10/19 20:37:21 2021/10/19 20:37:21 1 Yes 2 health check(s) listed [ceph: root@c8-node1 /]# curl -s http://192.168.122.201:9283/metrics | grep ceph_health_detail # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active) # TYPE ceph_health_detail gauge ceph_health_detail{name="MON_DISK_LOW",severity="HEALTH_WARN"} 0.0 ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 1.0 [ceph: root@c8-node1 /]# ceph osd unset noout noout is unset (after the next prometheus scrape interval) [ceph: root@c8-node1 /]# ceph healthcheck history ls Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active MON_DISK_LOW 2021/10/06 20:37:36 2021/10/06 20:38:36 2 No OSDMAP_FLAGS 2021/10/19 20:37:21 2021/10/19 20:37:21 1 No 2 health check(s) listed [ceph: root@c8-node1 /]# curl -s http://192.168.122.201:9283/metrics | grep ceph_health_detail # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active) # TYPE ceph_health_detail gauge ceph_health_detail{name="MON_DISK_LOW",severity="HEALTH_WARN"} 0.0 ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
Does that work for you?