Bug #42982
closedMonitoring: alert for "pool full" wrong
0%
Description
Hi all,
I am using the provided Prometheus monitoring on my nautilus cluster with the alerts from:
https://github.com/ceph/ceph/pull/27596/files and https://tracker.ceph.com/issues/24977
There is an alert rule to fire once a ceph pool gets dangerously full:
- alert: pool full expr: ceph_pool_stored / ceph_pool_max_avail * on(pool_id) group_right ceph_pool_metadata > 0.9
I am not 100% on the actual meaning of ceph_pool_max_avail
, but as far as I can infer, it means "if you only put new data into this ceph pool, this is the amount that you can at most add to the cluster before it is full". Or more precisely: "new additional data before it is full". This metric should be the same value as the MAX AVAIL column of ceph df
.
This means the alert expression seems wrong: it will fire when ceph_pool_max_avail
gets close to ceph_pool_stored
, so basically it fires at 50%-ish of your actual MAX AVAIL.
IMO the correct alert should be:
- alert: pool full expr: ceph_pool_stored / (ceph_pool_stored + ceph_pool_max_avail) * on(pool_id) group_right ceph_pool_metadata > 0.9
This expression also matches the implementation the ceph-ansible rules, but there the metric ceph_pool_bytes_used
is used:
https://github.com/ceph/ceph-ansible/blob/33bfb10af993faf97a976972f47344ab7ba51edf/roles/ceph-prometheus/files/ceph_dashboard.yml#L76-L77
If anyone can confirm this as a bug, I'm happy to send a PR to fix it.
These (potentially faulty) alert rules are present in git-master and the nautilus branch.
Updated by Greg Farnum over 4 years ago
- Project changed from Ceph to mgr
- Category set to prometheus module
Updated by Thomas Kriechbaumer over 4 years ago
One of our pools has now reached a value of 1.04 based on the original expression.
I would say it is obvious that the alert is wrong in this case, since our cluster is still healthy.
Updated by Jan Fajerski over 4 years ago
- Status changed from New to Pending Backport
- Pull request ID set to 32325
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43732: nautilus: Monitoring: alert for "pool full" wrong added
Updated by Stephan Müller almost 4 years ago
- Related to Bug #41829: ceph df reports incorrect pool usage added
Updated by Stephan Müller almost 4 years ago
- Related to Bug #40203: ceph df shows incorrect usage added
Updated by Stephan Müller almost 4 years ago
- Related to deleted (Bug #41829: ceph df reports incorrect pool usage)
Updated by Stephan Müller almost 4 years ago
- Related to deleted (Bug #40203: ceph df shows incorrect usage)
Updated by Konstantin Shalygin over 2 years ago
- Status changed from Pending Backport to Resolved