Project

General

Profile

Actions

Bug #64598

open

Radosgw Instance ID Mismatch between metadata counters and RGW exporter metrics

Added by Ali Maredia 2 months ago. Updated 8 days ago.

Status:
Triaged
Priority:
High
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Currently, metrics consumed by Prometheus related to the RGW are being generated by combining two parts:
1. The RGW perf counters: these counters are generated by the ceph-exporter by parsing the output of the rgw command `ceph counter dump`.
2. The RGW metadata (daemon, ceph-version, hostname, etc): this information is generated by the prometheus mgr module.

To combine the two parts ceph-exporter uses a key field called instance_id, which is generated as follows:
1. On the ceph-exporter side asok admin socket filename is parsed to extract the daemon_id which is used to derive the instance_id.
2. On the prometheus-mgr module side orchestrator (cephadm or rook) is called to get the daemon_id then instance_id is derived from the daemon_id

This approach/design suffers from the following issues:
1. It creates a strong dependency between prometheus-mgr module and the orchestrator module (this has already caused issues for Rook environments, ceph v18.2.1 metrics are completely broken because of this)
2. instance_id on the ceph-exporter side mgmt is weak as it relies on socket filename parsing
3. instance_id generation is error-prone as it relies on how daemon_ids are handled by the orchestrator module (which is difference between rook and cephadm)

The issue for RGW is that with certain orchestrators, for example in Rook, there is a mismatch between the instance IDs for the metrics emitted by the exporter and the metrics from the prometheus manager module.
This has ramifications when running queries in Prometheus when the instance id is the primary key between the metrics in the queries.

There is an email discussion happening about this. This tracker issue will be updated as consensus is reached on the solution.

Actions #1

Updated by Ali Maredia 2 months ago

  • Status changed from New to Triaged
Actions #2

Updated by Casey Bodley 8 days ago

  • Assignee deleted (Ali Maredia)
Actions

Also available in: Atom PDF