Bug #63927: Ceph dashboard, Cluster Utilization metrics does not work - Ceph - Ceph

Actions

Copy link

Bug #63927

open

Ceph dashboard, Cluster Utilization metrics does not work

Added by Morteza Bashsiz 4 months ago. Updated 3 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v18.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have upgraded ceph from 18.2.0 to 18.2.1

The Cluster Utilization metrics are not working, and all metrics are N/A and also our monitoring is affected

Actions

Copy link

Updated by Nizamudeen A 4 months ago

the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?

ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'

Actions

Copy link

Updated by Morteza Bashsiz 4 months ago

Nizamudeen A wrote:

the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?

ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'

Thanks for your answer,
The mgr/prometheus module is enabled same as past (we use the Prometheus inside manager). We never configured node-exporter and Prometheus services.
Here you see the services which are used in both clusters (18.2.0 vs 18.2.1) (osd is excluded for making it shorter)

root@cluster1:/# ceph -v 
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
root@cluster1:/# ceph orch ls | grep -v osd 
NAME                             PORTS   RUNNING  REFRESHED  AGE  PLACEMENT          
mgr                                          2/2  9m ago     6d   count:2;label:mon  
mon                                          3/3  9m ago     6d   label:mon          
root@cluster1:/#

root@cluster2:/# ceph -v 
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
root@cluster2:/# ceph orch ls | grep -v osd
NAME                             PORTS   RUNNING  REFRESHED  AGE  PLACEMENT          
mgr                                          2/2  10m ago    11w  count:2;label:mon  
mon                                          5/5  10m ago    11w  label:mon          
root@cluster2:/#

The problem is that mgr/prometheus metrics are different in 18.2.1 and there are some metrics which exist in 18.2.0 but does not exist it 18.2.1
For example, pay attention to following metrics

root@cluster1:/# ceph -v 
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
root@cluster1:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg | wc -l
30
root@cluster1:/#

root@cluster2:/# ceph -v 
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
root@cluster2:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg | wc -l
0
root@cluster2:/#

As you can see, some metrics are there in 18.2.0 which does not exist in 18.2.1, and I believe that this is the reason which mgr/dashboard is not able to detect them
When I compare the metrics from Prometheus between two versions, I see so many metrics are missing.

Actions

Copy link

Updated by Avan Thakkar 4 months ago

Morteza Bashsiz wrote:

Nizamudeen A wrote:

the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?

ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'

Thanks for your answer,
The mgr/prometheus module is enabled same as past (we use the Prometheus inside manager). We never configured node-exporter and Prometheus services.
Here you see the services which are used in both clusters (18.2.0 vs 18.2.1) (osd is excluded for making it shorter)

[...]

[...]

The problem is that mgr/prometheus metrics are different in 18.2.1 and there are some metrics which exist in 18.2.0 but does not exist it 18.2.1
For example, pay attention to following metrics

[...]

[...]

As you can see, some metrics are there in 18.2.0 which does not exist in 18.2.1, and I believe that this is the reason which mgr/dashboard is not able to detect them
When I compare the metrics from Prometheus between two versions, I see so many metrics are missing.

It seems you're missing perf counters metrics which were by default exposed by mgr/prometheus. Can you check if you have ceph-exporter daemons running?(per node, default port: 9926) As those are now responsible for exposing the perf counters metrics and thus we have disabled this option for mgr/prometheus by default. But if you wish to enable it again you could do it as mentioned in doc here https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics

Actions

Copy link

Updated by Morteza Bashsiz 4 months ago

Avan Thakkar wrote:

Morteza Bashsiz wrote:

Nizamudeen A wrote:

the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?

ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'

Thanks for your answer,
The mgr/prometheus module is enabled same as past (we use the Prometheus inside manager). We never configured node-exporter and Prometheus services.
Here you see the services which are used in both clusters (18.2.0 vs 18.2.1) (osd is excluded for making it shorter)

[...]

[...]

The problem is that mgr/prometheus metrics are different in 18.2.1 and there are some metrics which exist in 18.2.0 but does not exist it 18.2.1
For example, pay attention to following metrics

[...]

[...]

As you can see, some metrics are there in 18.2.0 which does not exist in 18.2.1, and I believe that this is the reason which mgr/dashboard is not able to detect them
When I compare the metrics from Prometheus between two versions, I see so many metrics are missing.

It seems you're missing perf counters metrics which were by default exposed by mgr/prometheus. Can you check if you have ceph-exporter daemons running?(per node, default port: 9926) As those are now responsible for exposing the perf counters metrics and thus we have disabled this option for mgr/prometheus by default. But if you wish to enable it again you could do it as mentioned in doc here https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics

Thanks a ton for the information
I have checked the both ways
First I tried to enable the exclude_perf_counters and check if the metrics appear or not, Unfortunately I didn't see the metrics in mgr/prometheus even with restarting mgr service

root@cluster:/# ceph -v 
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
root@cluster:/# ceph config get mgr mgr/prometheus/exclude_perf_counters
true
root@cluster:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg
root@cluster:/

Then I tried to apply the ceph-exporter on all nodes, As you see following, the metrics appeared (each node for itself)

root@ybk140901:/# ceph orch apply ceph-exporter '*'
Scheduled ceph-exporter update...
root@ybk140901:/# ceph orch ps | grep -i exporter
ceph-exporter.node1   node1                    running (11s)     2s ago  11s    7818k        -  18.2.1   d2cdd87030d1  50eb2ab7c3b2  
ceph-exporter.node2   node2                    running (7s)      2s ago   7s    7680k        -  18.2.1   d2cdd87030d1  5f4df829fab5  
ceph-exporter.node3   node3                    running (5s)      1s ago   5s    7810k        -  18.2.1   d2cdd87030d1  552a52fd6fd2  
ceph-exporter.node4   node4                    running (9s)      2s ago   9s    7578k        -  18.2.1   d2cdd87030d1  af7e3e0a27f9  
root@ybk140901:/# curl -s -XGET http://127.0.0.1:9926/metrics | grep ceph_osd_numpg
# HELP ceph_osd_numpg Placement groups
# TYPE ceph_osd_numpg gauge
ceph_osd_numpg{ceph_daemon="osd.0"} 78
ceph_osd_numpg{ceph_daemon="osd.1"} 80
ceph_osd_numpg{ceph_daemon="osd.2"} 145
root@cluster:/# ceph -s | grep pgs
    pools:   12 pools, 401 pgs
    pgs:     401 active+clean
root@cluster:/#

On the other hand, all of them are visible in mgr/prometheus

root@cluster:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg
# HELP ceph_osd_numpg Placement groups
# TYPE ceph_osd_numpg gauge
ceph_osd_numpg{ceph_daemon="osd.0"} 78.0
ceph_osd_numpg{ceph_daemon="osd.1"} 80.0
ceph_osd_numpg{ceph_daemon="osd.2"} 145.0
ceph_osd_numpg{ceph_daemon="osd.3"} 82.0
ceph_osd_numpg{ceph_daemon="osd.4"} 88.0
ceph_osd_numpg{ceph_daemon="osd.5"} 150.0
ceph_osd_numpg{ceph_daemon="osd.6"} 81.0
ceph_osd_numpg{ceph_daemon="osd.7"} 84.0
ceph_osd_numpg{ceph_daemon="osd.8"} 106.0
ceph_osd_numpg{ceph_daemon="osd.9"} 111.0
ceph_osd_numpg{ceph_daemon="osd.10"} 82.0
ceph_osd_numpg{ceph_daemon="osd.11"} 83.0
ceph_osd_numpg{ceph_daemon="osd.12"} 161.0

Thanks a lot for your help
I think we need to enable the ceph-exporter service in our confuguration
But one another question that would be great if you help me find the answer
What was the reason that the metrics were excluded by default on mgr/prometheus?

Actions

Copy link

Updated by Morteza Bashsiz 4 months ago

Update:
Sorry for the misunderstanding
The metrics with exclude_perf_counters:false and no ceph-exporter are fine

Actions

Copy link

Updated by Jan Horacek 3 months ago

hit by this too. thank you for both mentioned ways to fix this.

i think, this should be in release notes. not just some note on ceph-exporter, but when automatic upgrade process does not switch this automagicaly, then relase notes could mention some post-upgrade tasks like this (opt1 - deploy ceph-exporter on each node, opt2 - just enable perf metrics the old way by configuring the boolean mentioned)

post-upgrade section is already present in release notes, so it just means to mention that there.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #63927

Ceph dashboard, Cluster Utilization metrics does not work

Updated by Nizamudeen A 4 months ago

Updated by Morteza Bashsiz 4 months ago

Updated by Avan Thakkar 4 months ago

Updated by Morteza Bashsiz 4 months ago

Updated by Morteza Bashsiz 4 months ago

Updated by Jan Horacek 3 months ago