Project

General

Profile

Actions

Support #58737

open

Spurious alerts from Prometheus rules after daemon restart

Added by Anthony D'Atri about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitoring/Alerting
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

This comes from the context of Rook, which mirrors https://github.com/ceph/ceph/blob/main/monitoring/ceph-mixin/prometheus_alerts.yml#L507

Alerting rule question: CephNodeDiskspaceWarning uses the expression `predict_linear(node_filesystem_free_bytes{device=~“/.*“}[2d], 3600 * 24 * 5) * on(instance) group_left(nodename) node_uname_info < 0`
We find that when maintenance causes pods (prometheus? node_exporter ? ) to restart, we get a lot of spurious alerts because the interpolation doesn’t have 2 dates of baseline data, so maybe we’re effectively dividing by zero or a very small number. Is this a known phenomenon? Suggestions to mitigate the false alarms without missing problems during those first 2 days? Wondering if leveraging $now - process start time for prometheus > 2 days ago would work, adding a CephNodeDiskspaceCritical alert on an instant metric at say 90% so we don’t miss problems in the first two days?

No data to display

Actions

Also available in: Atom PDF