Project

General

Profile

Actions

Feature #24977

closed

Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditions

Added by Lenz Grimmer almost 6 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
% Done:

0%

Source:
Tags:
Backport:
nautilus
Reviewed:
Affected Versions:
Pull request ID:

Description

We should create a number of pre-defined Prometheus alert manager configuration files that trigger alerts on specific Ceph error conditions. At a minimum, the following conditions should trigger an alert:

  • Change in Ceph Cluster health (state change)
  • Disks near full
  • OSDs that are down
  • OSD hosts that are down
  • OSD Host Loss Check
  • Slow OSD response
  • OSDs with High PG Count
  • PGs stuck
  • Network Packet Drops and Errors
  • Pool capacity utilization
  • MONs Down (state change)
  • Cluster Capacity Utilization
  • Capacity forecast warning (if capacity would be exhausted within 6 weeks)

These alert manager configuration files should be included in the upstream Ceph code base as a reference implementation.


Related issues 2 (0 open2 closed)

Related to Dashboard - Feature #36241: mgr/dashboard: Add support for managing Prometheus alertsResolvedStephan Müller

Actions
Copied to mgr - Backport #39540: nautilus: Provide a base set of Prometheus alert manager rules that notify the user about common Ceph error conditionsResolvedNathan CutlerActions
Actions

Also available in: Atom PDF