Support #58737
openSpurious alerts from Prometheus rules after daemon restart
0%
Description
This comes from the context of Rook, which mirrors https://github.com/ceph/ceph/blob/main/monitoring/ceph-mixin/prometheus_alerts.yml#L507
Alerting rule question: CephNodeDiskspaceWarning uses the expression `predict_linear(node_filesystem_free_bytes{device=~“/.*“}[2d], 3600 * 24 * 5) * on(instance) group_left(nodename) node_uname_info < 0`
We find that when maintenance causes pods (prometheus? node_exporter ? ) to restart, we get a lot of spurious alerts because the interpolation doesn’t have 2 dates of baseline data, so maybe we’re effectively dividing by zero or a very small number. Is this a known phenomenon? Suggestions to mitigate the false alarms without missing problems during those first 2 days? Wondering if leveraging $now - process start time for prometheus > 2 days ago would work, adding a CephNodeDiskspaceCritical alert on an instant metric at say 90% so we don’t miss problems in the first two days?
No data to display