Bug #51131
openprometheus stats missing since upgrade to octopus 15.2.13
0%
Description
I recently upgraded one of my clusters from nautilus 14.2.21 on ubuntu to octopus 15.2.13. Since then
I do not get prometheus metrics anymore for some ceph_pg_* counters. A curl http://mgr:9283/metrics shows this for the missing data (some lines above and below kept for context)
ceph_pg_clean{pool_id="9"} 64.0
- HELP ceph_pg_down PG down per pool
- TYPE ceph_pg_down gauge
- HELP ceph_pg_recovery_unfound PG recovery_unfound per pool
- TYPE ceph_pg_recovery_unfound gauge
- HELP ceph_pg_backfill_unfound PG backfill_unfound per pool
- TYPE ceph_pg_backfill_unfound gauge
- HELP ceph_pg_scrubbing PG scrubbing per pool
- TYPE ceph_pg_scrubbing gauge
- HELP ceph_pg_degraded PG degraded per pool
- TYPE ceph_pg_degraded gauge
- HELP ceph_pg_inconsistent PG inconsistent per pool
- TYPE ceph_pg_inconsistent gauge
- HELP ceph_pg_peering PG peering per pool
- TYPE ceph_pg_peering gauge
- HELP ceph_pg_repair PG repair per pool
- TYPE ceph_pg_repair gauge
- HELP ceph_pg_recovering PG recovering per pool
- TYPE ceph_pg_recovering gauge
- HELP ceph_pg_forced_recovery PG forced_recovery per pool
- TYPE ceph_pg_forced_recovery gauge
- HELP ceph_pg_backfill_wait PG backfill_wait per pool
- TYPE ceph_pg_backfill_wait gauge
- HELP ceph_pg_incomplete PG incomplete per pool
- TYPE ceph_pg_incomplete gauge
- HELP ceph_pg_stale PG stale per pool
- TYPE ceph_pg_stale gauge
- HELP ceph_pg_remapped PG remapped per pool
- TYPE ceph_pg_remapped gauge
- HELP ceph_pg_deep PG deep per pool
- TYPE ceph_pg_deep gauge
- HELP ceph_pg_backfilling PG backfilling per pool
- TYPE ceph_pg_backfilling gauge
- HELP ceph_pg_forced_backfill PG forced_backfill per pool
- TYPE ceph_pg_forced_backfill gauge
- HELP ceph_pg_backfill_toofull PG backfill_toofull per pool
- TYPE ceph_pg_backfill_toofull gauge
- HELP ceph_pg_recovery_wait PG recovery_wait per pool
- TYPE ceph_pg_recovery_wait gauge
- HELP ceph_pg_recovery_toofull PG recovery_toofull per pool
- TYPE ceph_pg_recovery_toofull gauge
- HELP ceph_pg_undersized PG undersized per pool
- TYPE ceph_pg_undersized gauge
- HELP ceph_pg_activating PG activating per pool
- TYPE ceph_pg_activating gauge
- HELP ceph_pg_peered PG peered per pool
- TYPE ceph_pg_peered gauge
- HELP ceph_pg_snaptrim PG snaptrim per pool
- TYPE ceph_pg_snaptrim gauge
- HELP ceph_pg_snaptrim_wait PG snaptrim_wait per pool
- TYPE ceph_pg_snaptrim_wait gauge
- HELP ceph_pg_snaptrim_error PG snaptrim_error per pool
- TYPE ceph_pg_snaptrim_error gauge
- HELP ceph_pg_creating PG creating per pool
- TYPE ceph_pg_creating gauge
- HELP ceph_pg_unknown PG unknown per pool
- TYPE ceph_pg_unknown gauge
- HELP ceph_pg_premerge PG premerge per pool
- TYPE ceph_pg_premerge gauge
- HELP ceph_pg_failed_repair PG failed_repair per pool
- TYPE ceph_pg_failed_repair gauge
- HELP ceph_pg_laggy PG laggy per pool
- TYPE ceph_pg_laggy gauge
- HELP ceph_pg_wait PG wait per pool
- TYPE ceph_pg_wait gauge
- HELP ceph_cluster_total_bytes DF total_bytes
- TYPE ceph_cluster_total_bytes gauge
ceph_cluster_total_bytes 232986825752576.0 - HELP ceph_cluster_total_used_bytes DF total_used_bytes
- TYPE ceph_cluster_total_used_bytes gauge
ceph_cluster_total_used_bytes 9457124081664.0
Updated by Loïc Dachary almost 3 years ago
- Target version deleted (
v15.2.13) - Affected Versions v15.2.13 added
Updated by Neha Ojha almost 3 years ago
- Project changed from Ceph to mgr
- Category set to prometheus module
Updated by Peter Razumovsky over 2 years ago
Any progress? Same for me on ceph v15.2.13.
Updated by Neha Ojha over 2 years ago
- Assignee set to Paul Cuzner
Hey Paul, I am assigning this you, in case you have any ideas on what's going on here.
Updated by Paul Cuzner about 2 years ago
I saw this in pacific too. I think the values if zero are no longer emitted e.g if there isn't any peering going on pg_peering will not be seen, but as soon as it is, it's present. It doesn't present a problem for alerts AFAIK
Updated by Peter Razumovsky about 2 years ago
we still facing with this issue on v15.2.13 with our pre-defined alerts:
----------------------------- Captured stdout call ----------------------------- [INFO]: Checking metric/expression "sum by(rook_cluster, name) (ceph_pg_inconsistent * on(pool_id) group_right() ceph_pool_metadata) <= 0" [WARNING]: Metric/expression "sum by(rook_cluster, name) (ceph_pg_inconsistent * on(pool_id) group_right() ceph_pool_metadata) <= 0" not found [INFO]: Checking that CephPGInconsistent alert is firing ----------------------------- Captured stdout call ----------------------------- [INFO]: Checking metric/expression "sum by(rook_cluster, name) (ceph_pg_inconsistent * on(pool_id) group_right() ceph_pool_metadata) <= 0" [WARNING]: Metric/expression "sum by(rook_cluster, name) (ceph_pg_inconsistent * on(pool_id) group_right() ceph_pool_metadata) <= 0" not found [INFO]: Checking that CephPGInconsistent alert is firing ----------------------------- Captured stdout call ----------------------------- [INFO]: Checking metric/expression "sum by(rook_cluster, name) (ceph_pg_inconsistent * on(pool_id) group_right() ceph_pool_metadata) <= 0" [WARNING]: Metric/expression "sum by(rook_cluster, name) (ceph_pg_inconsistent * on(pool_id) group_right() ceph_pool_metadata) <= 0" not found [INFO]: Checking that CephPGInconsistent alert is firing
Updated by Peter Razumovsky almost 2 years ago
we are still facing this issue. Any updates?