Project

General

Profile

Feature #24977

Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions

Added by Lenz Grimmer over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
% Done:

0%

Source:
Tags:
Backport:
nautilus
Reviewed:
Affected Versions:
Pull request ID:

Description

We should create a number of pre-defined Prometheus alert manager configuration files that trigger alerts on specific Ceph error conditions. At a minimum, the following conditions should trigger an alert:

  • Change in Ceph Cluster health (state change)
  • Disks near full
  • OSDs that are down
  • OSD hosts that are down
  • OSD Host Loss Check
  • Slow OSD response
  • OSDs with High PG Count
  • PGs stuck
  • Network Packet Drops and Errors
  • Pool capacity utilization
  • MONs Down (state change)
  • Cluster Capacity Utilization
  • Capacity forecast warning (if capacity would be exhausted within 6 weeks)

These alert manager configuration files should be included in the upstream Ceph code base as a reference implementation.


Related issues

Related to Dashboard - Feature #36241: mgr/dashboard: Add support for managing Prometheus alerts Resolved
Copied to mgr - Backport #39540: nautilus: Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions Resolved

History

#1 Updated by Jan Fajerski over 5 years ago

  • Assignee set to Jan Fajerski

Happy to take a stab at it unless someone beats me to it, input is obviously very welcome.

#2 Updated by Anonymous over 5 years ago

Node RAM utilization >95% (maybe some other number here?)
MDS down
RGW down
Network utilization >95%
downrev daemons
hosts with nearly filled root or /var/log partitions?

#3 Updated by Tobias Florek over 5 years ago

What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?

Regarding collaborating:

  • Should we just post some alert rules here?
  • Should we use node_exporter for the disk utilization?

Example Alert: Ceph health not "OK":

alert: ceph_Health_status
expr: ceph_health_status != 0
for: 10m
annotations:
  description: Ceph unhealthy for > 10m
  summary: Ceph unhealthy

#4 Updated by Jan Fajerski over 5 years ago

Tobias Florek wrote:

What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?

For now I was thinking another subfolder in https://github.com/ceph/ceph/tree/master/monitoring would be good. Lets maybe start with a single rules file that can just be dropped into a prometheus deployment.

Regarding collaborating:

  • Should we just post some alert rules here?

Either that or attach a file with alerts. Or even just some general remarks...all is welcome.

  • Should we use node_exporter for the disk utilization?

Yes

Example Alert: Ceph health not "OK":

[...]

I was thinking it would make sense to group alerts my severity. We should maybe sort the into, say, 3 severity buckets (critical, warning, info or some such) so user can use the default alerts in different alert channels (no body wants a text message about a HEALTH_WARN cluster at 3am Saturday I assume). But again, I'm open to suggestions from experienced operators.

#5 Updated by Lenz Grimmer over 5 years ago

  • Related to Feature #36241: mgr/dashboard: Add support for managing Prometheus alerts added

#6 Updated by Lenz Grimmer over 5 years ago

  • Target version set to v14.0.0

#7 Updated by Ernesto Puerta almost 5 years ago

Just to keep track of them, there are currently 2 existing rule sets that might serve for a future reference ruleset:
  1. DeepSea (by @Jan)
  2. Ceph-Ansible (by @Boris)
  3. Ceph-mixins

#9 Updated by Lenz Grimmer almost 5 years ago

  • Tags set to monitoring
  • Status changed from New to Fix Under Review
  • Target version changed from v14.0.0 to v15.0.0
  • Backport set to nautilus
  • Pull request ID set to 27596

#10 Updated by Lenz Grimmer almost 5 years ago

  • Status changed from Fix Under Review to Pending Backport

#11 Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #39540: nautilus: Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions added

#12 Updated by Nathan Cutler almost 5 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF