Project

General

Profile

Actions

Bug #42982

closed

Monitoring: alert for "pool full" wrong

Added by Thomas Kriechbaumer over 4 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi all,

I am using the provided Prometheus monitoring on my nautilus cluster with the alerts from:
https://github.com/ceph/ceph/pull/27596/files and https://tracker.ceph.com/issues/24977

There is an alert rule to fire once a ceph pool gets dangerously full:

- alert: pool full
  expr: ceph_pool_stored / ceph_pool_max_avail * on(pool_id) group_right ceph_pool_metadata > 0.9

I am not 100% on the actual meaning of ceph_pool_max_avail, but as far as I can infer, it means "if you only put new data into this ceph pool, this is the amount that you can at most add to the cluster before it is full". Or more precisely: "new additional data before it is full". This metric should be the same value as the MAX AVAIL column of ceph df.

This means the alert expression seems wrong: it will fire when ceph_pool_max_avail gets close to ceph_pool_stored, so basically it fires at 50%-ish of your actual MAX AVAIL.

IMO the correct alert should be:

- alert: pool full
  expr: ceph_pool_stored / (ceph_pool_stored + ceph_pool_max_avail) * on(pool_id) group_right ceph_pool_metadata > 0.9

This expression also matches the implementation the ceph-ansible rules, but there the metric ceph_pool_bytes_used is used:
https://github.com/ceph/ceph-ansible/blob/33bfb10af993faf97a976972f47344ab7ba51edf/roles/ceph-prometheus/files/ceph_dashboard.yml#L76-L77

If anyone can confirm this as a bug, I'm happy to send a PR to fix it.
These (potentially faulty) alert rules are present in git-master and the nautilus branch.


Related issues 1 (0 open1 closed)

Copied to mgr - Backport #43732: nautilus: Monitoring: alert for "pool full" wrongRejectedAlfonso MartínezActions
Actions

Also available in: Atom PDF