CDM 02-FEB-2022

  • How should we collect metrics from key daemons?
    Currently all daemons report performance and state to the mgr, and the mgr/prometheus module exposes this data to monitoring and alerting daemons. However, as our demand for more operational data grows this approach places further demand on the mgr. To address this there have been two strategies proposed.
    1. Place an exporter daemon on every node: This daemon then becomes the contact point for monitoring stacks, and would be responsible for gathering state/performance/capacity info from each daemon on that node. This approach isolates the 3rd party prometheus integration library from core ceph code, eliminating the potential of this layer impacting the stability or security of the main ceph daemons. The main downside is that the daemon would need need to actively discover and collect data from all other daemons on the host (which in a containerised context could prove challenging/problematic especially in a kubernetes environment...hostpath?). However, having a single collector on each node, has the benefit of keeping the number of end points to scrape = number of nodes in the ceph cluster and adds only a single new TCP port requirement per node.
    2. Embed the exporter http(s) endpoint into the relevant ceph daemons: This approach would extend the current rbd-mirror, cephfs-mirror, radosgw daemons to include a http endpoint based on beast and the prometheus-cpp library. Since this approach embeds the endpoint with the main daemon, extracting and exposing the data is straightforward. The downside is that the introduction of the 3rd-party code could affect stability or security of the ceph daemon, and if the strategy encompasses OSDs, prometheus will have potentially 000's of additional endpoints complicating the prometheus config, and the service monitor definitions to point to all the individual daemons.

Aside from the implementation differences, the other factor to consider is the sample size returned to Prometheus server. For example, mgr/prometheus currently returns perf counters for the whole cluster. This is problematic. On a large cluster of 3,776 OSDs, the mgr/prometheus module attempts to return 850,000 samples (50+MB) to prometheus every 15s! This results in scrape failures and stale data, which impacts monitoring and alerting. Which ever option is chosen, the endgame should be to limit the metrics we expose by default to those we will consume in monitoring, alerting and dashboards.

[Ernesto]: What about considering a 3rd approach?: Sticking to the current architecture (Ceph-mgr as a highly-available single-source-of-truth and caching layer for Ceph management & monitoring) and tackling the bottlenecks we know: we do know that the C++ to Python serialization of large data chunks (plus Python Global Interpreter Lock, plus our own locks) results in thread contention under heavy load (clusters with high number of OSDs, PGs, etc).

So, why not:
  • Providing fine-grained ceph-mgr API calls.
    • E.g.: serializing only perf-counters above a given prio-limit (there are just 10 perf-counters with PRIO_CRITICAL vs. 68 with >= PRIO_INTERESTING vs. 169 with >PRIO_USEFUL). We could raise all the perfcounters that we need for Grafana/Prometheus to PRIO_CRITICAL or PRIO_INTERESTING (or add a new tag-like field to perfcounters).
  • Exploring other (perhaps more efficient) serialization formats (JSON, MsgPack, Google's FlatBuffers). The existing Formatter class makes (almost) straightforward to support new data formats.
  • Bringing true active-active (load balancing) scalability to ceph-mgr's.

My point here is that by moving the metric collection outside the Ceph-mgr we are not solving the root-cause issue (which will still impact other modules: dashboard, progress, etc.) and we're diluting (undermining) the purpose of the Ceph-mgr.