Bug #24379: MGR not reporting metrics when osds are going down - mgr - Ceph

Actions

Copy link

Bug #24379

closed

MGR not reporting metrics when osds are going down

Added by Alexander Trost almost 6 years ago. Updated over 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Jan Fajerski

Category:

prometheus module

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.4

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The original messages can be found here: https://rook-io.slack.com/archives/C46Q5UC05/p1527829950000056

christian.huening [11 hours ago]
we migrated our rook ceph cluster over to a new network. So we had to take nodes out of the cluster and in again one by one. While doing so the cluster would go into `HEALTH_WARN` mode as expected. However, during those phases the `/metrics` endpoint of the `ceph-mgr` stopped working and we didn't get any metrics out of it. Did anyone have the same behavior? Is that a known issue?

Alexander Trost [9 hours ago]
I don't think that is known yet.
I think it could be because the MGR timeouts (probably getting OSD metrics) depending on your scrape timeout.
The prometheus-operator manifest uses 5s interval and timeout by "default" so that could be the culprit.

[...]

christian.huening [7 hours ago]
i tried to hit the mgr just with a curl request and a much longer timeout. nothing happened

Alexander Trost [7 hours ago]
Mhh then this really seems like a bug

christian.huening [7 hours ago]
especially since as soon as the cluster became healthy again monitoring came back to life

Alexander Trost [7 hours ago]
Was there anything in the MGR logs that shows let say "getting metrics from OSDs" and then "failing"/timeouting?

christian.huening [6 hours ago]
most of it looks like this:
```
2018-05-31 22:15:26.117945 I | ceph-mgr: 2018-05-31 22:15:26.117798 7f0bc1410700  1 mgr send_beacon active
2018-05-31 22:15:28.131018 I | ceph-mgr: 2018-05-31 22:15:28.130879 7f0bc1410700  1 mgr send_beacon active
2018-05-31 22:15:30.132571 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:22:15:30] "GET /metrics HTTP/1.1" 200 123082 "" "Prometheus/2.0.0" 
2018-05-31 22:15:30.142292 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:22:15:30] "GET /metrics HTTP/1.1" 200 123082 "" "Prometheus/2.0.0"```

which actually looks like it’s working

christian.huening [6 hours ago]
Before that on same day i had some of those:
```
2018-05-31 12:35:24.957177 I | ceph-mgr: 2018-05-31 12:35:24.957054 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:24.957279 I | ceph-mgr: 2018-05-31 12:35:24.957258 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:24.961047 I | ceph-mgr: 2018-05-31 12:35:24.960949 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:24.961059 I | ceph-mgr: 2018-05-31 12:35:24.960998 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:25.147606 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:12:35:25] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 
2018-05-31 12:35:25.156733 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:12:35:25] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 
2018-05-31 12:35:25.688217 I | ceph-mgr: 2018-05-31 12:35:25.688013 7f0bc1410700  1 mgr send_beacon active
2018-05-31 12:35:27.705530 I | ceph-mgr: 2018-05-31 12:35:27.705362 7f0bc1410700  1 mgr send_beacon active
2018-05-31 12:35:29.707236 I | ceph-mgr: 2018-05-31 12:35:29.707085 7f0bc1410700  1 mgr send_beacon active
2018-05-31 12:35:29.956050 I | ceph-mgr: 2018-05-31 12:35:29.955939 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:29.956083 I | ceph-mgr: 2018-05-31 12:35:29.956022 7f0bb892f700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:29.959043 I | ceph-mgr: 2018-05-31 12:35:29.958979 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfill_wait
2018-05-31 12:35:29.959087 I | ceph-mgr: 2018-05-31 12:35:29.959025 7f0bb9130700  1 mgr[prometheus] skipping pg in unknown state backfilling
2018-05-31 12:35:30.124190 I | ceph-mgr: ::ffff:10.200.33.67 - - [31/May/2018:12:35:30] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0" 
2018-05-31 12:35:30.136347 I | ceph-mgr: ::ffff:10.200.10.89 - - [31/May/2018:12:35:30] "GET /metrics HTTP/1.1" 200 122793 "" "Prometheus/2.0.0"```

The MGR is running inside a Kubernetes Pod using Rook.io (v0.7.1) which has the following Ceph version running:

# ceph -v
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)

Please let Christian and me know if you need more information!

Actions

Copy link

Updated by Jan Fajerski over 5 years ago

Sorry for not seeing this.

Did this cluster have multiple mgr daemons? Did this maybe cause an active mgr failover? The prometheus module on a standby mgr simply answer with an empty response. The assumption is that prometheus scrapes all mgr's at the same time.

Actions

Copy link