Bug #63927
openCeph dashboard, Cluster Utilization metrics does not work
0%
Description
I have upgraded ceph from 18.2.0 to 18.2.1
The Cluster Utilization metrics are not working, and all metrics are N/A and also our monitoring is affected
Updated by Nizamudeen A 4 months ago
the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?
ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'
Updated by Morteza Bashsiz 4 months ago
Nizamudeen A wrote:
the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?
ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'
Thanks for your answer,
The mgr/prometheus module is enabled same as past (we use the Prometheus inside manager). We never configured node-exporter and Prometheus services.
Here you see the services which are used in both clusters (18.2.0 vs 18.2.1) (osd is excluded for making it shorter)
root@cluster1:/# ceph -v
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
root@cluster1:/# ceph orch ls | grep -v osd
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mgr 2/2 9m ago 6d count:2;label:mon
mon 3/3 9m ago 6d label:mon
root@cluster1:/#
root@cluster2:/# ceph -v
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
root@cluster2:/# ceph orch ls | grep -v osd
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mgr 2/2 10m ago 11w count:2;label:mon
mon 5/5 10m ago 11w label:mon
root@cluster2:/#
The problem is that mgr/prometheus metrics are different in 18.2.1 and there are some metrics which exist in 18.2.0 but does not exist it 18.2.1
For example, pay attention to following metrics
root@cluster1:/# ceph -v
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
root@cluster1:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg | wc -l
30
root@cluster1:/#
root@cluster2:/# ceph -v
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
root@cluster2:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg | wc -l
0
root@cluster2:/#
As you can see, some metrics are there in 18.2.0 which does not exist in 18.2.1, and I believe that this is the reason which mgr/dashboard is not able to detect them
When I compare the metrics from Prometheus between two versions, I see so many metrics are missing.
Updated by Avan Thakkar 4 months ago
Morteza Bashsiz wrote:
Nizamudeen A wrote:
the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?
ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'Thanks for your answer,
The mgr/prometheus module is enabled same as past (we use the Prometheus inside manager). We never configured node-exporter and Prometheus services.
Here you see the services which are used in both clusters (18.2.0 vs 18.2.1) (osd is excluded for making it shorter)[...]
[...]
The problem is that mgr/prometheus metrics are different in 18.2.1 and there are some metrics which exist in 18.2.0 but does not exist it 18.2.1
For example, pay attention to following metrics[...]
[...]
As you can see, some metrics are there in 18.2.0 which does not exist in 18.2.1, and I believe that this is the reason which mgr/dashboard is not able to detect them
When I compare the metrics from Prometheus between two versions, I see so many metrics are missing.
It seems you're missing perf counters metrics which were by default exposed by mgr/prometheus. Can you check if you have ceph-exporter daemons running?(per node, default port: 9926) As those are now responsible for exposing the perf counters metrics and thus we have disabled this option for mgr/prometheus by default. But if you wish to enable it again you could do it as mentioned in doc here https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics
Updated by Morteza Bashsiz 4 months ago
Avan Thakkar wrote:
Morteza Bashsiz wrote:
Nizamudeen A wrote:
the Cluster Utilization gets the metric from prometheus so you'll need to have a prometheus configured inorder to view those. If you have prometheus running, can you check if the PROMETHEUS_API_HOST is updated with the url of prometheus?
ceph dashboard get-prometheus-api-host
ceph dashboard set-prometheus-api-host 'http://localhost:9090'Thanks for your answer,
The mgr/prometheus module is enabled same as past (we use the Prometheus inside manager). We never configured node-exporter and Prometheus services.
Here you see the services which are used in both clusters (18.2.0 vs 18.2.1) (osd is excluded for making it shorter)[...]
[...]
The problem is that mgr/prometheus metrics are different in 18.2.1 and there are some metrics which exist in 18.2.0 but does not exist it 18.2.1
For example, pay attention to following metrics[...]
[...]
As you can see, some metrics are there in 18.2.0 which does not exist in 18.2.1, and I believe that this is the reason which mgr/dashboard is not able to detect them
When I compare the metrics from Prometheus between two versions, I see so many metrics are missing.It seems you're missing perf counters metrics which were by default exposed by mgr/prometheus. Can you check if you have ceph-exporter daemons running?(per node, default port: 9926) As those are now responsible for exposing the perf counters metrics and thus we have disabled this option for mgr/prometheus by default. But if you wish to enable it again you could do it as mentioned in doc here https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics
Thanks a ton for the information
I have checked the both ways
First I tried to enable the exclude_perf_counters and check if the metrics appear or not, Unfortunately I didn't see the metrics in mgr/prometheus even with restarting mgr service
root@cluster:/# ceph -v
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
root@cluster:/# ceph config get mgr mgr/prometheus/exclude_perf_counters
true
root@cluster:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg
root@cluster:/
Then I tried to apply the ceph-exporter on all nodes, As you see following, the metrics appeared (each node for itself)
root@ybk140901:/# ceph orch apply ceph-exporter '*'
Scheduled ceph-exporter update...
root@ybk140901:/# ceph orch ps | grep -i exporter
ceph-exporter.node1 node1 running (11s) 2s ago 11s 7818k - 18.2.1 d2cdd87030d1 50eb2ab7c3b2
ceph-exporter.node2 node2 running (7s) 2s ago 7s 7680k - 18.2.1 d2cdd87030d1 5f4df829fab5
ceph-exporter.node3 node3 running (5s) 1s ago 5s 7810k - 18.2.1 d2cdd87030d1 552a52fd6fd2
ceph-exporter.node4 node4 running (9s) 2s ago 9s 7578k - 18.2.1 d2cdd87030d1 af7e3e0a27f9
root@ybk140901:/# curl -s -XGET http://127.0.0.1:9926/metrics | grep ceph_osd_numpg
# HELP ceph_osd_numpg Placement groups
# TYPE ceph_osd_numpg gauge
ceph_osd_numpg{ceph_daemon="osd.0"} 78
ceph_osd_numpg{ceph_daemon="osd.1"} 80
ceph_osd_numpg{ceph_daemon="osd.2"} 145
root@cluster:/# ceph -s | grep pgs
pools: 12 pools, 401 pgs
pgs: 401 active+clean
root@cluster:/#
On the other hand, all of them are visible in mgr/prometheus
root@cluster:/# curl -s -XGET http://127.0.0.1:9283/metrics | grep ceph_osd_numpg
# HELP ceph_osd_numpg Placement groups
# TYPE ceph_osd_numpg gauge
ceph_osd_numpg{ceph_daemon="osd.0"} 78.0
ceph_osd_numpg{ceph_daemon="osd.1"} 80.0
ceph_osd_numpg{ceph_daemon="osd.2"} 145.0
ceph_osd_numpg{ceph_daemon="osd.3"} 82.0
ceph_osd_numpg{ceph_daemon="osd.4"} 88.0
ceph_osd_numpg{ceph_daemon="osd.5"} 150.0
ceph_osd_numpg{ceph_daemon="osd.6"} 81.0
ceph_osd_numpg{ceph_daemon="osd.7"} 84.0
ceph_osd_numpg{ceph_daemon="osd.8"} 106.0
ceph_osd_numpg{ceph_daemon="osd.9"} 111.0
ceph_osd_numpg{ceph_daemon="osd.10"} 82.0
ceph_osd_numpg{ceph_daemon="osd.11"} 83.0
ceph_osd_numpg{ceph_daemon="osd.12"} 161.0
Thanks a lot for your help
I think we need to enable the ceph-exporter service in our confuguration
But one another question that would be great if you help me find the answer
What was the reason that the metrics were excluded by default on mgr/prometheus?
Updated by Morteza Bashsiz 4 months ago
Update:
Sorry for the misunderstanding
The metrics with exclude_perf_counters:false and no ceph-exporter are fine
Updated by Jan Horacek 3 months ago
hit by this too. thank you for both mentioned ways to fix this.
i think, this should be in release notes. not just some note on ceph-exporter, but when automatic upgrade process does not switch this automagicaly, then relase notes could mention some post-upgrade tasks like this (opt1 - deploy ceph-exporter on each node, opt2 - just enable perf metrics the old way by configuring the boolean mentioned)
post-upgrade section is already present in release notes, so it just means to mention that there.