Actions
Bug #24379
closedMGR not reporting metrics when osds are going down
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
The original messages can be found here: https://rook-io.slack.com/archives/C46Q5UC05/p1527829950000056
christian.huening [11 hours ago] we migrated our rook ceph cluster over to a new network. So we had to take nodes out of the cluster and in again one by one. While doing so the cluster would go into `HEALTH_WARN` mode as expected. However, during those phases the `/metrics` endpoint of the `ceph-mgr` stopped working and we didn't get any metrics out of it. Did anyone have the same behavior? Is that a known issue? Alexander Trost [9 hours ago] I don't think that is known yet. I think it could be because the MGR timeouts (probably getting OSD metrics) depending on your scrape timeout. The prometheus-operator manifest uses 5s interval and timeout by "default" so that could be the culprit. [...] christian.huening [7 hours ago] i tried to hit the mgr just with a curl request and a much longer timeout. nothing happened Alexander Trost [7 hours ago] Mhh then this really seems like a bug christian.huening [7 hours ago] especially since as soon as the cluster became healthy again monitoring came back to life Alexander Trost [7 hours ago] Was there anything in the MGR logs that shows let say "getting metrics from OSDs" and then "failing"/timeouting? christian.huening [6 hours ago] most of it looks like this: ``` 2018-05-31 22:15:26.117945 I | ceph-mgr: 2018-05-31 22:15:26.117798 7f0bc1410700 1 mgr send_beacon active 2018-05-31 22:15:28.131018 I | ceph-mgr: 2018-05-31 22:15:28.130879 7f0bc1410700 1 mgr send_beacon active 2018-05-31 22:15:30.132571 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:22:15:30] "GET /metrics HTTP/1.1" 200 123082 "" "Prometheus/2.0.0" 2018-05-31 22:15:30.142292 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:22:15:30] "GET /metrics HTTP/1.1" 200 123082 "" "Prometheus/2.0.0"``` which actually looks like it’s working christian.huening [6 hours ago] Before that on same day i had some of those: ``` 2018-05-31 12:35:24.957177 I | ceph-mgr: 2018-05-31 12:35:24.957054 7f0bb892f700 1 mgr[prometheus] skipping pg in unknown state backfill_wait 2018-05-31 12:35:24.957279 I | ceph-mgr: 2018-05-31 12:35:24.957258 7f0bb892f700 1 mgr[prometheus] skipping pg in unknown state backfilling 2018-05-31 12:35:24.961047 I | ceph-mgr: 2018-05-31 12:35:24.960949 7f0bb9130700 1 mgr[prometheus] skipping pg in unknown state backfill_wait 2018-05-31 12:35:24.961059 I | ceph-mgr: 2018-05-31 12:35:24.960998 7f0bb9130700 1 mgr[prometheus] skipping pg in unknown state backfilling 2018-05-31 12:35:25.147606 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:12:35:25] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 2018-05-31 12:35:25.156733 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:12:35:25] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 2018-05-31 12:35:25.688217 I | ceph-mgr: 2018-05-31 12:35:25.688013 7f0bc1410700 1 mgr send_beacon active 2018-05-31 12:35:27.705530 I | ceph-mgr: 2018-05-31 12:35:27.705362 7f0bc1410700 1 mgr send_beacon active 2018-05-31 12:35:29.707236 I | ceph-mgr: 2018-05-31 12:35:29.707085 7f0bc1410700 1 mgr send_beacon active 2018-05-31 12:35:29.956050 I | ceph-mgr: 2018-05-31 12:35:29.955939 7f0bb892f700 1 mgr[prometheus] skipping pg in unknown state backfill_wait 2018-05-31 12:35:29.956083 I | ceph-mgr: 2018-05-31 12:35:29.956022 7f0bb892f700 1 mgr[prometheus] skipping pg in unknown state backfilling 2018-05-31 12:35:29.959043 I | ceph-mgr: 2018-05-31 12:35:29.958979 7f0bb9130700 1 mgr[prometheus] skipping pg in unknown state backfill_wait 2018-05-31 12:35:29.959087 I | ceph-mgr: 2018-05-31 12:35:29.959025 7f0bb9130700 1 mgr[prometheus] skipping pg in unknown state backfilling 2018-05-31 12:35:30.124190 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:12:35:30] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 2018-05-31 12:35:30.136347 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:12:35:30] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0"```
The MGR is running inside a Kubernetes Pod using Rook.io (v0.7.1) which has the following Ceph version running:
# ceph -v
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
Please let Christian and me know if you need more information!
Updated by Jan Fajerski over 5 years ago
Sorry for not seeing this.
Did this cluster have multiple mgr daemons? Did this maybe cause an active mgr failover? The prometheus module on a standby mgr simply answer with an empty response. The assumption is that prometheus scrapes all mgr's at the same time.
Updated by Jan Fajerski over 5 years ago
- Status changed from New to Closed
Closing due to age. Feel free to re-open if necessary.
Actions