Bug #42982: Monitoring: alert for "pool full" wrong - mgr - Ceph

Actions

Copy link

Bug #42982

closed

Monitoring: alert for "pool full" wrong

Added by Thomas Kriechbaumer over 4 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jan Fajerski

Category:

prometheus module

Target version:

% Done:

Source:

Tags:

Backport:

nautilus

Regression:

Severity:

4 - irritation

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

32325

Crash signature (v1):

Crash signature (v2):

Description

Hi all,

I am using the provided Prometheus monitoring on my nautilus cluster with the alerts from:
https://github.com/ceph/ceph/pull/27596/files and https://tracker.ceph.com/issues/24977

There is an alert rule to fire once a ceph pool gets dangerously full:

- alert: pool full
  expr: ceph_pool_stored / ceph_pool_max_avail * on(pool_id) group_right ceph_pool_metadata > 0.9

I am not 100% on the actual meaning of ceph_pool_max_avail, but as far as I can infer, it means "if you only put new data into this ceph pool, this is the amount that you can at most add to the cluster before it is full". Or more precisely: "new additional data before it is full". This metric should be the same value as the MAX AVAIL column of ceph df.

This means the alert expression seems wrong: it will fire when ceph_pool_max_avail gets close to ceph_pool_stored, so basically it fires at 50%-ish of your actual MAX AVAIL.

IMO the correct alert should be:

- alert: pool full
  expr: ceph_pool_stored / (ceph_pool_stored + ceph_pool_max_avail) * on(pool_id) group_right ceph_pool_metadata > 0.9

This expression also matches the implementation the ceph-ansible rules, but there the metric ceph_pool_bytes_used is used:
https://github.com/ceph/ceph-ansible/blob/33bfb10af993faf97a976972f47344ab7ba51edf/roles/ceph-prometheus/files/ceph_dashboard.yml#L76-L77

If anyone can confirm this as a bug, I'm happy to send a PR to fix it.
These (potentially faulty) alert rules are present in git-master and the nautilus branch.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Greg Farnum over 4 years ago

Project changed from Ceph to mgr
Category set to prometheus module

Actions

Copy link

Updated by Jan Fajerski over 4 years ago

Assignee set to Jan Fajerski

Actions

Copy link

Updated by Thomas Kriechbaumer over 4 years ago

One of our pools has now reached a value of 1.04 based on the original expression.
I would say it is obvious that the alert is wrong in this case, since our cluster is still healthy.

Actions

Copy link