Project

General

Profile

Bug #24151

ceph-mgr have lost prio=0 perf counters? get_counter seem to ignore them

Added by Peter Gervai about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

This was observed on the prometheus module, but the problem seems to be general mgr one.

I have lost a lot of perf counters in the upgrade to 12.2.2+ I believe (but can't pin it since it's been a while), namely 'osd.n.bluestore.bluestore_compressed' was one which I have continuously monitored. It's still visible in both 'perf dump' and 'perf schema', and I see it's priority=0 (debug).

However I cannot seem to be able to pull them up no matter what:

                c = self.get_counter( service['type'], service['id'], "bluestore.bluestore_compressed" )

results empty, while, for example
                c = self.get_counter( service['type'], service['id'], "bluestore.submit_lat" )

gives the result.

I wasn't able to figure out if prio gets filtered inside get_counter somehow, and if it does how to lift it. This breaks pretty lots of graphs (prometheus and else).

History

#1 Updated by John Spray about 2 years ago

Performance counters are indeed filtered by priority, this is controlled by a ceph-mgr setting called mgr_stats_threshold

If you set it to zero then you'll get everything -- a pretty huge number of counters, but on a smaller cluster that won't hurt too badly.

#2 Updated by John Spray about 2 years ago

  • Status changed from New to Closed

#3 Updated by Peter Gervai about 2 years ago

Thanks!

This is dangerously underdocumented to the point that I don't even have an immediate idea how to set it (apart from guessing the GLOBAL section of the ceph.conf). I usually prefer issues to be converted to documentation problems when there exists a good, working but completely hidden answer. (Try to google for "mgr_stats_threshold" or "ceph-mgr setting" and you probably see what I mean: no nothing.) And this have changed behaviour between updates (and conversely stomped on lots of graphs which were collected but not anymore).

I am not sure whether it could be set in a mgr module, or is it a global-only flag. Or else. So I would rather prefer a few words about this entering the docs before closing this issue into oblivion. (Until then I'll try to guess how it ought to work.)

#4 Updated by John Spray about 2 years ago

It might be a bit of an overstatement to call this dangerous -- data loss is dangerous, a hidden perf counter is annoying :-)

The reason you're not seeing the setting's documentation online is that it has a documentation string in the code, but unfortunately the work to generate the web docs from that metadata hasn't happened yet.

If you can work out a good place to add some words about this to the documentation then PRs are always welcome.

Also available in: Atom PDF