Feature #24977
closed
Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions
Added by Lenz Grimmer almost 6 years ago.
Updated almost 5 years ago.
Category:
prometheus module
Description
We should create a number of pre-defined Prometheus alert manager configuration files that trigger alerts on specific Ceph error conditions. At a minimum, the following conditions should trigger an alert:
- Change in Ceph Cluster health (state change)
- Disks near full
- OSDs that are down
- OSD hosts that are down
- OSD Host Loss Check
- Slow OSD response
- OSDs with High PG Count
- PGs stuck
- Network Packet Drops and Errors
- Pool capacity utilization
- MONs Down (state change)
- Cluster Capacity Utilization
- Capacity forecast warning (if capacity would be exhausted within 6 weeks)
These alert manager configuration files should be included in the upstream Ceph code base as a reference implementation.
- Assignee set to Jan Fajerski
Happy to take a stab at it unless someone beats me to it, input is obviously very welcome.
Node RAM utilization >95% (maybe some other number here?)
MDS down
RGW down
Network utilization >95%
downrev daemons
hosts with nearly filled root or /var/log partitions?
What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?
Regarding collaborating:
- Should we just post some alert rules here?
- Should we use node_exporter for the disk utilization?
Example Alert: Ceph health not "OK":
alert: ceph_Health_status
expr: ceph_health_status != 0
for: 10m
annotations:
description: Ceph unhealthy for > 10m
summary: Ceph unhealthy
Tobias Florek wrote:
What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?
For now I was thinking another subfolder in https://github.com/ceph/ceph/tree/master/monitoring would be good. Lets maybe start with a single rules file that can just be dropped into a prometheus deployment.
Regarding collaborating:
- Should we just post some alert rules here?
Either that or attach a file with alerts. Or even just some general remarks...all is welcome.
- Should we use node_exporter for the disk utilization?
Yes
Example Alert: Ceph health not "OK":
[...]
I was thinking it would make sense to group alerts my severity. We should maybe sort the into, say, 3 severity buckets (critical, warning, info or some such) so user can use the default alerts in different alert channels (no body wants a text message about a HEALTH_WARN cluster at 3am Saturday I assume). But again, I'm open to suggestions from experienced operators.
- Related to Feature #36241: mgr/dashboard: Add support for managing Prometheus alerts added
- Target version set to v14.0.0
Just to keep track of them, there are currently 2 existing rule sets that might serve for a future reference ruleset:
- DeepSea (by @Jan)
- Ceph-Ansible (by @Boris)
- Ceph-mixins
- Translation missing: en.field_tag_list set to monitoring
- Status changed from New to Fix Under Review
- Target version changed from v14.0.0 to v15.0.0
- Backport set to nautilus
- Pull request ID set to 27596
- Status changed from Fix Under Review to Pending Backport
- Copied to Backport #39540: nautilus: Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions added
- Status changed from Pending Backport to Resolved
Also available in: Atom
PDF