ceph-mgr have lost prio=0 perf counters? get_counter seem to ignore them
This was observed on the prometheus module, but the problem seems to be general mgr one.
I have lost a lot of perf counters in the upgrade to 12.2.2+ I believe (but can't pin it since it's been a while), namely 'osd.n.bluestore.bluestore_compressed' was one which I have continuously monitored. It's still visible in both 'perf dump' and 'perf schema', and I see it's priority=0 (debug).
However I cannot seem to be able to pull them up no matter what:
c = self.get_counter( service['type'], service['id'], "bluestore.bluestore_compressed" )
results empty, while, for example
c = self.get_counter( service['type'], service['id'], "bluestore.submit_lat" )
gives the result.
I wasn't able to figure out if prio gets filtered inside get_counter somehow, and if it does how to lift it. This breaks pretty lots of graphs (prometheus and else).
#1 Updated by John Spray about 2 years ago
Performance counters are indeed filtered by priority, this is controlled by a ceph-mgr setting called mgr_stats_threshold
If you set it to zero then you'll get everything -- a pretty huge number of counters, but on a smaller cluster that won't hurt too badly.
#3 Updated by Peter Gervai about 2 years ago
This is dangerously underdocumented to the point that I don't even have an immediate idea how to set it (apart from guessing the GLOBAL section of the ceph.conf). I usually prefer issues to be converted to documentation problems when there exists a good, working but completely hidden answer. (Try to google for "mgr_stats_threshold" or "ceph-mgr setting" and you probably see what I mean: no nothing.) And this have changed behaviour between updates (and conversely stomped on lots of graphs which were collected but not anymore).
I am not sure whether it could be set in a mgr module, or is it a global-only flag. Or else. So I would rather prefer a few words about this entering the docs before closing this issue into oblivion. (Until then I'll try to guess how it ought to work.)
#4 Updated by John Spray about 2 years ago
It might be a bit of an overstatement to call this dangerous -- data loss is dangerous, a hidden perf counter is annoying :-)
The reason you're not seeing the setting's documentation online is that it has a documentation string in the code, but unfortunately the work to generate the web docs from that metadata hasn't happened yet.
If you can work out a good place to add some words about this to the documentation then PRs are always welcome.