Project

General

Profile

Actions

Bug #52714

open

mgr/prometheus: Update ceph_healthcheck_* metric value to 1 when triggered

Added by Jinmyeong Lee over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We want to add more health check condition in here to monitor the cluster easily. (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L115)

HEALTH_CHECKS = [
    alert_metric('SLOW_OPS', 'OSD or Monitor requests taking a long time to process'),
]

And the default value is 0 (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L558)
When some health warning is triggered, then it should be marked to 1( or another value not the default value 0), but it isn't because of here(https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L540)

I fixed this and tested our private cluster.
This is an example of ceph_exporter.

ceph set osd noscrub

# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 1.0

After ceph unset osd noscrub

# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 0.0
Actions

Also available in: Atom PDF