Project

General

Profile

Actions

Bug #42982

closed

Monitoring: alert for "pool full" wrong

Added by Thomas Kriechbaumer over 4 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi all,

I am using the provided Prometheus monitoring on my nautilus cluster with the alerts from:
https://github.com/ceph/ceph/pull/27596/files and https://tracker.ceph.com/issues/24977

There is an alert rule to fire once a ceph pool gets dangerously full:

- alert: pool full
  expr: ceph_pool_stored / ceph_pool_max_avail * on(pool_id) group_right ceph_pool_metadata > 0.9

I am not 100% on the actual meaning of ceph_pool_max_avail, but as far as I can infer, it means "if you only put new data into this ceph pool, this is the amount that you can at most add to the cluster before it is full". Or more precisely: "new additional data before it is full". This metric should be the same value as the MAX AVAIL column of ceph df.

This means the alert expression seems wrong: it will fire when ceph_pool_max_avail gets close to ceph_pool_stored, so basically it fires at 50%-ish of your actual MAX AVAIL.

IMO the correct alert should be:

- alert: pool full
  expr: ceph_pool_stored / (ceph_pool_stored + ceph_pool_max_avail) * on(pool_id) group_right ceph_pool_metadata > 0.9

This expression also matches the implementation the ceph-ansible rules, but there the metric ceph_pool_bytes_used is used:
https://github.com/ceph/ceph-ansible/blob/33bfb10af993faf97a976972f47344ab7ba51edf/roles/ceph-prometheus/files/ceph_dashboard.yml#L76-L77

If anyone can confirm this as a bug, I'm happy to send a PR to fix it.
These (potentially faulty) alert rules are present in git-master and the nautilus branch.


Related issues 1 (0 open1 closed)

Copied to mgr - Backport #43732: nautilus: Monitoring: alert for "pool full" wrongRejectedAlfonso MartínezActions
Actions #1

Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to mgr
  • Category set to prometheus module
Actions #2

Updated by Jan Fajerski over 4 years ago

  • Assignee set to Jan Fajerski
Actions #3

Updated by Thomas Kriechbaumer over 4 years ago

One of our pools has now reached a value of 1.04 based on the original expression.
I would say it is obvious that the alert is wrong in this case, since our cluster is still healthy.

Actions #4

Updated by Alfonso Martínez over 4 years ago

  • Backport set to nautilus
Actions #5

Updated by Jan Fajerski over 4 years ago

  • Status changed from New to Pending Backport
  • Pull request ID set to 32325
Actions #6

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #43732: nautilus: Monitoring: alert for "pool full" wrong added
Actions #7

Updated by Stephan Müller almost 4 years ago

  • Related to Bug #41829: ceph df reports incorrect pool usage added
Actions #8

Updated by Stephan Müller almost 4 years ago

  • Related to Bug #40203: ceph df shows incorrect usage added
Actions #9

Updated by Stephan Müller almost 4 years ago

  • Related to deleted (Bug #41829: ceph df reports incorrect pool usage)
Actions #10

Updated by Stephan Müller almost 4 years ago

  • Related to deleted (Bug #40203: ceph df shows incorrect usage)
Actions #11

Updated by Konstantin Shalygin over 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF