Feature #24977
Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions
0%
Description
We should create a number of pre-defined Prometheus alert manager configuration files that trigger alerts on specific Ceph error conditions. At a minimum, the following conditions should trigger an alert:
- Change in Ceph Cluster health (state change)
- Disks near full
- OSDs that are down
- OSD hosts that are down
- OSD Host Loss Check
- Slow OSD response
- OSDs with High PG Count
- PGs stuck
- Network Packet Drops and Errors
- Pool capacity utilization
- MONs Down (state change)
- Cluster Capacity Utilization
- Capacity forecast warning (if capacity would be exhausted within 6 weeks)
These alert manager configuration files should be included in the upstream Ceph code base as a reference implementation.
Related issues
History
#1 Updated by Jan Fajerski over 5 years ago
- Assignee set to Jan Fajerski
Happy to take a stab at it unless someone beats me to it, input is obviously very welcome.
#2 Updated by Anonymous over 5 years ago
Node RAM utilization >95% (maybe some other number here?)
MDS down
RGW down
Network utilization >95%
downrev daemons
hosts with nearly filled root or /var/log partitions?
#3 Updated by Tobias Florek over 5 years ago
What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?
Regarding collaborating:
- Should we just post some alert rules here?
- Should we use node_exporter for the disk utilization?
Example Alert: Ceph health not "OK":
alert: ceph_Health_status expr: ceph_health_status != 0 for: 10m annotations: description: Ceph unhealthy for > 10m summary: Ceph unhealthy
#4 Updated by Jan Fajerski over 5 years ago
Tobias Florek wrote:
What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?
For now I was thinking another subfolder in https://github.com/ceph/ceph/tree/master/monitoring would be good. Lets maybe start with a single rules file that can just be dropped into a prometheus deployment.
Regarding collaborating:
- Should we just post some alert rules here?
Either that or attach a file with alerts. Or even just some general remarks...all is welcome.
- Should we use node_exporter for the disk utilization?
Yes
Example Alert: Ceph health not "OK":
[...]
I was thinking it would make sense to group alerts my severity. We should maybe sort the into, say, 3 severity buckets (critical, warning, info or some such) so user can use the default alerts in different alert channels (no body wants a text message about a HEALTH_WARN cluster at 3am Saturday I assume). But again, I'm open to suggestions from experienced operators.
#5 Updated by Lenz Grimmer over 5 years ago
- Related to Feature #36241: mgr/dashboard: Add support for managing Prometheus alerts added
#6 Updated by Lenz Grimmer over 5 years ago
- Target version set to v14.0.0
#7 Updated by Ernesto Puerta almost 5 years ago
- DeepSea (by @Jan)
- Ceph-Ansible (by @Boris)
- Ceph-mixins
#8 Updated by Jan Fajerski almost 5 years ago
First attempt: https://github.com/ceph/ceph/pull/27596
#9 Updated by Lenz Grimmer almost 5 years ago
- Tags set to monitoring
- Status changed from New to Fix Under Review
- Target version changed from v14.0.0 to v15.0.0
- Backport set to nautilus
- Pull request ID set to 27596
#10 Updated by Lenz Grimmer almost 5 years ago
- Status changed from Fix Under Review to Pending Backport
#11 Updated by Nathan Cutler almost 5 years ago
- Copied to Backport #39540: nautilus: Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions added
#12 Updated by Nathan Cutler almost 5 years ago
- Status changed from Pending Backport to Resolved