Project

General

Profile

Actions

Bug #52714

open

mgr/prometheus: Update ceph_healthcheck_* metric value to 1 when triggered

Added by Jinmyeong Lee over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We want to add more health check condition in here to monitor the cluster easily. (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L115)

HEALTH_CHECKS = [
    alert_metric('SLOW_OPS', 'OSD or Monitor requests taking a long time to process'),
]

And the default value is 0 (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L558)
When some health warning is triggered, then it should be marked to 1( or another value not the default value 0), but it isn't because of here(https://github.com/ceph/ceph/blob/master/src/pybind/mgr/prometheus/module.py#L540)

I fixed this and tested our private cluster.
This is an example of ceph_exporter.

ceph set osd noscrub

# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 1.0

After ceph unset osd noscrub

# HELP ceph_healthcheck_osdmap_flags OSD Flags (just for testing metric)
# TYPE ceph_healthcheck_osdmap_flags gauge
ceph_healthcheck_osdmap_flags 0.0
Actions #1

Updated by Paul Cuzner over 2 years ago

Take a look at this PR - https://github.com/ceph/ceph/pull/43293

It's adding tracking for all healthchecks, and introduces additional commands so you can see the cluster's healthcheck history over time.

For example.

[ceph: root@c8-node1 /]# ceph healthcheck history ls 
Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
MON_DISK_LOW              2021/10/06 20:37:36   2021/10/06 20:38:36       2    No  
1 health check(s) listed
[ceph: root@c8-node1 /]# ceph osd set noout
noout is set
[ceph: root@c8-node1 /]# ceph healthcheck history ls 
Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
MON_DISK_LOW              2021/10/06 20:37:36   2021/10/06 20:38:36       2    No  
OSDMAP_FLAGS              2021/10/19 20:37:21   2021/10/19 20:37:21       1   Yes  
2 health check(s) listed

[ceph: root@c8-node1 /]# curl -s http://192.168.122.201:9283/metrics | grep ceph_health_detail 
# HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
# TYPE ceph_health_detail gauge
ceph_health_detail{name="MON_DISK_LOW",severity="HEALTH_WARN"} 0.0
ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 1.0

[ceph: root@c8-node1 /]# ceph osd unset noout 
noout is unset

(after the next prometheus scrape interval)

[ceph: root@c8-node1 /]# ceph healthcheck history ls 
Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
MON_DISK_LOW              2021/10/06 20:37:36   2021/10/06 20:38:36       2    No  
OSDMAP_FLAGS              2021/10/19 20:37:21   2021/10/19 20:37:21       1    No  
2 health check(s) listed

[ceph: root@c8-node1 /]# curl -s http://192.168.122.201:9283/metrics | grep ceph_health_detail 
# HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
# TYPE ceph_health_detail gauge
ceph_health_detail{name="MON_DISK_LOW",severity="HEALTH_WARN"} 0.0
ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0

Does that work for you?

Actions #2

Updated by Sebastian Wagner over 2 years ago

  • Project changed from Ceph to mgr
Actions

Also available in: Atom PDF