Support #58737: Spurious alerts from Prometheus rules after daemon restart - Ceph - Ceph

Actions

Copy link

Support #58737

open

Spurious alerts from Prometheus rules after daemon restart

Added by Anthony D'Atri about 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Monitoring/Alerting

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

This comes from the context of Rook, which mirrors https://github.com/ceph/ceph/blob/main/monitoring/ceph-mixin/prometheus_alerts.yml#L507

Alerting rule question: CephNodeDiskspaceWarning uses the expression `predict_linear(node_filesystem_free_bytes{device=~“/.*“}[2d], 3600 * 24 * 5) * on(instance) group_left(nodename) node_uname_info < 0`
We find that when maintenance causes pods (prometheus? node_exporter ? ) to restart, we get a lot of spurious alerts because the interpolation doesn’t have 2 dates of baseline data, so maybe we’re effectively dividing by zero or a very small number. Is this a known phenomenon? Suggestions to mitigate the false alarms without missing problems during those first 2 days? Wondering if leveraging $now - process start time for prometheus > 2 days ago would work, adding a CephNodeDiskspaceCritical alert on an instant metric at say 90% so we don’t miss problems in the first two days?

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Support #58737

Spurious alerts from Prometheus rules after daemon restart