Project

General

Profile

Feature #24977

Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions

Added by Lenz Grimmer 5 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
Start date:
07/18/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

We should create a number of pre-defined Prometheus alert manager configuration files that trigger alerts on specific Ceph error conditions. At a minimum, the following conditions should trigger an alert:

  • Change in Ceph Cluster health (state change)
  • Disks near full
  • OSDs that are down
  • OSD hosts that are down
  • OSD Host Loss Check
  • Slow OSD response
  • OSDs with High PG Count
  • PGs stuck
  • Network Packet Drops and Errors
  • Pool capacity utilization
  • MONs Down (state change)
  • Cluster Capacity Utilization
  • Capacity forecast warning (if capacity would be exhausted within 6 weeks)

These alert manager configuration files should be included in the upstream Ceph code base as a reference implementation.


Related issues

Related to mgr - Feature #36241: mgr/dashboard: Add support for managing Prometheus alerts In Progress 11/07/2018

History

#1 Updated by Jan Fajerski 5 months ago

  • Assignee set to Jan Fajerski

Happy to take a stab at it unless someone beats me to it, input is obviously very welcome.

#2 Updated by David Byte 4 months ago

Node RAM utilization >95% (maybe some other number here?)
MDS down
RGW down
Network utilization >95%
downrev daemons
hosts with nearly filled root or /var/log partitions?

#3 Updated by Tobias Florek 4 months ago

What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?

Regarding collaborating:

  • Should we just post some alert rules here?
  • Should we use node_exporter for the disk utilization?

Example Alert: Ceph health not "OK":

alert: ceph_Health_status
expr: ceph_health_status != 0
for: 10m
annotations:
  description: Ceph unhealthy for > 10m
  summary: Ceph unhealthy

#4 Updated by Jan Fajerski 4 months ago

Tobias Florek wrote:

What's the preferred distribution format? Should there simply be a sample *.rules file included in some package?

For now I was thinking another subfolder in https://github.com/ceph/ceph/tree/master/monitoring would be good. Lets maybe start with a single rules file that can just be dropped into a prometheus deployment.

Regarding collaborating:

  • Should we just post some alert rules here?

Either that or attach a file with alerts. Or even just some general remarks...all is welcome.

  • Should we use node_exporter for the disk utilization?

Yes

Example Alert: Ceph health not "OK":

[...]

I was thinking it would make sense to group alerts my severity. We should maybe sort the into, say, 3 severity buckets (critical, warning, info or some such) so user can use the default alerts in different alert channels (no body wants a text message about a HEALTH_WARN cluster at 3am Saturday I assume). But again, I'm open to suggestions from experienced operators.

#5 Updated by Lenz Grimmer 3 months ago

  • Related to Feature #36241: mgr/dashboard: Add support for managing Prometheus alerts added

#6 Updated by Lenz Grimmer about 1 month ago

  • Target version set to v14.0.0

Also available in: Atom PDF