Project

General

Profile

Actions

Bug #53190

open

counter num_read_kb is going down

Added by Patrick Seidensal over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
monitoring
Backport:
octopus pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description of problem

An unreasonably high read metric value has been reported by monitoring (28.76TB/s).

This is due to Ceph providing a value for `num_read_kb` which has decreased. Prometheus treats that as a counter reset, assuming that the value is to be added to the previously collected one and thereby producing the reported unreasonably high value.

We've been able to verify that this is not an issue of the mgr/prometheus module but a value that comes from Ceph, however, we do not know how it is reproduced.

pg-dump.2021-09-14T18:26:50+01:00 716138503663,
pg-dump.2021-09-14T18:27:03+01:00 716138539210,
pg-dump.2021-09-14T18:27:16+01:00 716138564623,
pg-dump.2021-09-14T18:27:28+01:00 716137750423, <- 1631640448 (epoch)
pg-dump.2021-09-14T18:27:41+01:00 716137808867,
pg-dump.2021-09-14T18:27:53+01:00 716137862127,

Environment

  • ceph version string: Octopus

How reproducible

No reproducer available at this point.

Actual results

Counter has decreased.

Expected results

Counter is only ever increased.

Additional info

This is an issue we've been able to see repeatably. However, we unfortunately do not know how to reproduce the issue and currently do not have access to the cluster which has been producing these values.

Actions #1

Updated by Josh Durgin over 2 years ago

This seems possible to occur for many such counters in a distributed system like ceph, where these values are not treated monotonically. Is there a way to report these to prometheus that accepts decreasing values?

Actions

Also available in: Atom PDF