Project

General

Profile

Actions

Bug #24379

closed

MGR not reporting metrics when osds are going down

Added by Alexander Trost almost 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The original messages can be found here: https://rook-io.slack.com/archives/C46Q5UC05/p1527829950000056

christian.huening [11 hours ago]
we migrated our rook ceph cluster over to a new network. So we had to take nodes out of the cluster and in again one by one. While doing so the cluster would go into `HEALTH_WARN` mode as expected. However, during those phases the `/metrics` endpoint of the `ceph-mgr` stopped working and we didn't get any metrics out of it. Did anyone have the same behavior? Is that a known issue?

Alexander Trost [9 hours ago]
I don't think that is known yet.
I think it could be because the MGR timeouts (probably getting OSD metrics) depending on your scrape timeout.
The prometheus-operator manifest uses 5s interval and timeout by "default" so that could be the culprit.

[...]

christian.huening [7 hours ago]
i tried to hit the mgr just with a curl request and a much longer timeout. nothing happened

Alexander Trost [7 hours ago]
Mhh then this really seems like a bug

christian.huening [7 hours ago]
especially since as soon as the cluster became healthy again monitoring came back to life

Alexander Trost [7 hours ago]
Was there anything in the MGR logs that shows let say "getting metrics from OSDs" and then "failing"/timeouting?

christian.huening [6 hours ago]
most of it looks like this:
```
2018-05-31 22:15:26.117945 I | ceph-mgr: 2018-05-31 22:15:26.117798 7f0bc1410700  1 mgr send_beacon active
2018-05-31 22:15:28.131018 I | ceph-mgr: 2018-05-31 22:15:28.130879 7f0bc1410700  1 mgr send_beacon active
2018-05-31 22:15:30.132571 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:22:15:30] "GET /metrics HTTP/1.1" 200 123082 "" "Prometheus/2.0.0" 
2018-05-31 22:15:30.142292 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:22:15:30] "GET /metrics HTTP/1.1" 200 123082 "" "Prometheus/2.0.0"```

which actually looks like it’s working

christian.huening [6 hours ago]
Before that on same day i had some of those:
```
2018-05-31 12:35:24.957177 I | ceph-mgr: 2018-05-31 12:35:24.957054 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:24.957279 I | ceph-mgr: 2018-05-31 12:35:24.957258 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:24.961047 I | ceph-mgr: 2018-05-31 12:35:24.960949 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:24.961059 I | ceph-mgr: 2018-05-31 12:35:24.960998 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:25.147606 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:12:35:25] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 
2018-05-31 12:35:25.156733 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:12:35:25] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 
2018-05-31 12:35:25.688217 I | ceph-mgr: 2018-05-31 12:35:25.688013 7f0bc1410700  1 mgr send_beacon active
2018-05-31 12:35:27.705530 I | ceph-mgr: 2018-05-31 12:35:27.705362 7f0bc1410700  1 mgr send_beacon active
2018-05-31 12:35:29.707236 I | ceph-mgr: 2018-05-31 12:35:29.707085 7f0bc1410700  1 mgr send_beacon active
2018-05-31 12:35:29.956050 I | ceph-mgr: 2018-05-31 12:35:29.955939 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:29.956083 I | ceph-mgr: 2018-05-31 12:35:29.956022 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:29.959043 I | ceph-mgr: 2018-05-31 12:35:29.958979 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:29.959087 I | ceph-mgr: 2018-05-31 12:35:29.959025 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:30.124190 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:12:35:30] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 
2018-05-31 12:35:30.136347 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:12:35:30] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0"```

The MGR is running inside a Kubernetes Pod using Rook.io (v0.7.1) which has the following Ceph version running:

# ceph -v
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)

Please let Christian and me know if you need more information!

Actions #1

Updated by Jan Fajerski over 5 years ago

Sorry for not seeing this.

Did this cluster have multiple mgr daemons? Did this maybe cause an active mgr failover? The prometheus module on a standby mgr simply answer with an empty response. The assumption is that prometheus scrapes all mgr's at the same time.

Actions #2

Updated by Jan Fajerski over 5 years ago

  • Assignee set to Jan Fajerski
Actions #3

Updated by Jan Fajerski over 5 years ago

  • Status changed from New to Closed

Closing due to age. Feel free to re-open if necessary.

Actions

Also available in: Atom PDF