Bug #53190
opencounter num_read_kb is going down
0%
Description
Description of problem¶
An unreasonably high read metric value has been reported by monitoring (28.76TB/s).
This is due to Ceph providing a value for `num_read_kb` which has decreased. Prometheus treats that as a counter reset, assuming that the value is to be added to the previously collected one and thereby producing the reported unreasonably high value.
We've been able to verify that this is not an issue of the mgr/prometheus module but a value that comes from Ceph, however, we do not know how it is reproduced.
pg-dump.2021-09-14T18:26:50+01:00 716138503663, pg-dump.2021-09-14T18:27:03+01:00 716138539210, pg-dump.2021-09-14T18:27:16+01:00 716138564623, pg-dump.2021-09-14T18:27:28+01:00 716137750423, <- 1631640448 (epoch) pg-dump.2021-09-14T18:27:41+01:00 716137808867, pg-dump.2021-09-14T18:27:53+01:00 716137862127,
Environment¶
ceph version
string: Octopus
How reproducible¶
No reproducer available at this point.
Actual results¶
Counter has decreased.
Expected results¶
Counter is only ever increased.
Additional info¶
This is an issue we've been able to see repeatably. However, we unfortunately do not know how to reproduce the issue and currently do not have access to the cluster which has been producing these values.
Updated by Josh Durgin over 2 years ago
This seems possible to occur for many such counters in a distributed system like ceph, where these values are not treated monotonically. Is there a way to report these to prometheus that accepts decreasing values?