Bug #64321: mgr/dashboard: dashboards and alerts from ceph-mixins not fully compatible with showMultiCluster=true (multiple Ceph clusters some Prometheus instance) - Dashboard - Ceph

Actions

Copy link

Bug #64321

open

mgr/dashboard: dashboards and alerts from ceph-mixins not fully compatible with showMultiCluster=true (multiple Ceph clusters some Prometheus instance)

Added by Christian Rohmann 3 months ago. Updated 2 days ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

squid,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

55495

Crash signature (v1):

Crash signature (v2):

Description

Description of problem¶

The ceph-mixins allow for dashboards and alerts to be made compatible with metrics of multiple Ceph clusters being stored in the same Prometheus instance. This can be achieved via the settings

    clusterLabel: 'cluster',
    showMultiCluster: true,

inside of https://github.com/ceph/ceph/blob/main/monitoring/ceph-mixin/config.libsonnet and then recompiling the dashboards and alerts.

Environment¶

ceph version string: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
Platform (OS/distro/release): Ubuntu 20.04 (Jammy)

How reproducible¶

Steps:

set showMultiCluster to true
run make generate
check out dashboards_out and prometheus_alerts.yml in regards to cluster label used consistently to allow for individual clusters to be targeted and to tolerate metrics for multiple clusters be stored in the same Prometheus instance

some tests (make test) seem to also fail when the showMultiCluster option is enabled. Maybe testing of them is not properly implemented at all?

Actual results¶

Some queries don't filter on cluster label, so metrics of multiple clusters are returned. This results dashboards showing metrics of multiple clusters in the same graphs or, in case of joins, label collisions occur due to the same label and value e.g. ceph_daemon="osd.0" being present multiple times (from different clusters). For alerts using joins collisions cause them to not be evaluated. The cluster name is not mentioned consistently in the description or summary.

Expected results¶

After selecting a cluster in the template Grafana only metrics for the same Ceph cluster are shown.
For alerts I expect them to work for a single Prometheus instance hosting the metrics for multuple Ceph clusters.

Additional info¶

There seems to be also some inconsistencies related to the "style" of dealing with the instance label (vs. hostname).
I raised another bug about that one in general - https://tracker.ceph.com/issues/64288