Project

General

Profile

Bug #42506

Prometheus module response times are consistently slow

Added by Janek Bevendorff over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a cluster with 5 MONs/MGRs and 1248 OSDs.

Our Prometheus nodes are polling the MGRs for metrics every few seconds from three different hosts. The average response time of the /metrics endpoint is about 3 seconds, which is relatively slow and hints at a large amount of work done behind the scenes. I wonder if this can be optimised. After all, ceph status is much faster.

Perhaps it would also be a good idea to remove the MGR bottleneck altogether and instead let each host report its own OSDs and the MGR hosts only report on general cluster statistics. That would also make debugging easier, because the Prometheus wouldn't report the MGR host as instance for all alerts.

Also available in: Atom PDF