Project

General

Profile

Actions

Bug #64321

open

mgr/dashboard: dashboards and alerts from ceph-mixins not fully compatible with showMultiCluster=true (multiple Ceph clusters some Prometheus instance)

Added by Christian Rohmann 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description of problem

The ceph-mixins allow for dashboards and alerts to be made compatible with metrics of multiple Ceph clusters being stored in the same Prometheus instance. This can be achieved via the settings

    clusterLabel: 'cluster',
    showMultiCluster: true,

inside of https://github.com/ceph/ceph/blob/main/monitoring/ceph-mixin/config.libsonnet and then recompiling the dashboards and alerts.

Environment

  • ceph version string: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
  • Platform (OS/distro/release): Ubuntu 20.04 (Jammy)

How reproducible

Steps:

  1. set showMultiCluster to true
  2. run make generate
  3. check out dashboards_out and prometheus_alerts.yml in regards to cluster label used consistently to allow for individual clusters to be targeted and to tolerate metrics for multiple clusters be stored in the same Prometheus instance
  1. some tests (make test) seem to also fail when the showMultiCluster option is enabled. Maybe testing of them is not properly implemented at all?

Actual results

Some queries don't filter on cluster label, so metrics of multiple clusters are returned. This results dashboards showing metrics of multiple clusters in the same graphs or, in case of joins, label collisions occur due to the same label and value e.g. ceph_daemon="osd.0" being present multiple times (from different clusters). For alerts using joins collisions cause them to not be evaluated. The cluster name is not mentioned consistently in the description or summary.

Expected results

After selecting a cluster in the template Grafana only metrics for the same Ceph cluster are shown.
For alerts I expect them to work for a single Prometheus instance hosting the metrics for multuple Ceph clusters.

Additional info

There seems to be also some inconsistencies related to the "style" of dealing with the instance label (vs. hostname).
I raised another bug about that one in general - https://tracker.ceph.com/issues/64288

Actions

Also available in: Atom PDF