Project

General

Profile

Actions

Feature #48388

open

Tasks #36451: mgr/dashboard: Scalability testing

mgr,mgr/dashboard: implement multi-layered caching

Added by Ernesto Puerta over 3 years ago. Updated about 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
General
Target version:
% Done:

89%

Source:
Development
Tags:
performance scalability
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Summary

In order to reduce the chances of mgr modules and, specifically, dashboard to compromise the Ceph cluster performance by an increase in the frequency of API calls (unlike other mgr Modules, Dashboard load is predominantly user-driven), a multi-layered caching approach should be put in place.

Current status

Existing caching approaches and issues:
  • Ceph-mgr itself caches a lot of API calls (get_module_option, get, get_server, get_metadata, ), so not every request to the ceph-mgr API hits the Ceph cluster. However, the send_command() is not cached and might have a performance impact.
  • Additionally, one bottleneck in ceph-mgr is the PyFormatter, the class responsible for deserializing C++ binary structs to Python objects. For big objects (osd_map) this deserialization is not negligible, so it might be worthy to explore caching the resulting deserialized Python object or explore an incremental approach that doesn't involve processing the same data time after time.
  • Dashboard-backend: ViewCache decouples REST controller request from ceph-mgr API ones and allows for asynchronous fetching of data.

The following picture shows the existing approaches (ViewCache) and the ones to explore, as a well as other potential points (PyFormatter and non-cached send_command):

Proposal

Layers:
  • Ceph-mgr API:
    • C++: this is optimal, as the cached data is shared across modules. However is less trivial to implement.
    • Python: cachetools. This one could be introduced at per-module level (every module interacts with each version of the cached ceph-mgr API methods) or shared (all modules consume the cached versions of the ceph-mgr API methods, although this could bring issues with modules modifying the objects returned by the cached methods).
  • Dashboard back-end:
    • Python: cachetools
    • CherryPy Cache (it also takes care of the HTTP caching)
  • Dashboard front-end:
Pros:
  • Reduced load in ceph-mgr
  • Shorter response times
Cons:
  • Increased memory usage
  • Stale data (while TTL caches can improve this)
  • Data serializations issues
  • Leaks/ref counting issues

Implementation details

As caching is an optimization strategy and optimization needs to be benchmarked, the first step would be to implement a way to measure the effectiveness of caching:
  • From the backend/inner side of the system, that could be the number of calls that actually hit the ceph-mgr API (log message, new mgr CLI command, ...)
  • From the user facing side of the system, that could be the latency of the call (cold, when cache is not populated, vs. warm cache, when cache is populated and is hit by the request). Here there 2 possible user-facing points: direct RESTful interface (via curl) and WebUI (via Angular/JS). The second one is more realistic but less finegrained, so the 1st one would be preferred for quick benchmarking.

Additionally, it would be interesting to short list the quick-wins, those ceph-mgr API calls and dashboard RESTful API endpoints that would benefit most from this caching scheme: those performing more calls to the ceph-mgr API. Keep in mind that the number of calls might depend on the scale of the cluster (e.g.: GET /rbd might required 1-many calls per RBD image created, and there are some users with thousands of RBD images).


Files

Ceph Dashboard Caching.png (281 KB) Ceph Dashboard Caching.png Ernesto Puerta, 03/23/2021 11:05 AM

Subtasks 10 (1 open9 closed)

Subtask #50310: CA: Add support to inject data in the dashboard backend side.ClosedPere Díaz Bou

Actions
Subtask #50311: CA: Lightweight cli module to expose get.ResolvedWaad Alkhoury

Actions
Subtask #51121: mgr/dashboard: final demoClosedAvan Thakkar

Actions
Subtask #51122: mgr/dashboard: new Grafana dashboard for Caching demoRejectedAvan Thakkar

Actions
Feature #51123: mgr/dashboard: add new caching metrics to prometheus-exporterNewAvan Thakkar

Actions
Documentation #52119: mgr/dashboard: document cachingResolvedPere Díaz Bou

Actions
Documentation #52120: mgr/dashboard: document CLI module/benchmarkingResolvedWaad Alkhoury

Actions
Subtask #52299: CA:Module exposing ceph-mgr python API via CLIResolvedWaad Alkhoury

Actions
Subtask #53561: mgr: TTL cache implementationResolvedPere Díaz Bou

Actions
Subtask #52834: mgr/dashboard: add TTL caching to ceph-mgr get methodResolvedPere Díaz Bou

Actions

Related issues 2 (2 open0 closed)

Related to Dashboard - Feature #25166: mgr/dashboard: Add cache pool supportNew

Actions
Related to Dashboard - Feature #40912: mgr/dashboard: REST API: review cachingNew

Actions
Actions #1

Updated by Ernesto Puerta about 3 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Ernesto Puerta about 3 years ago

Actions #3

Updated by Ernesto Puerta about 3 years ago

  • Description updated (diff)
Actions #4

Updated by Alfonso Martínez about 3 years ago

  • Related to Feature #25166: mgr/dashboard: Add cache pool support added
Actions #5

Updated by Alfonso Martínez about 3 years ago

  • Related to Feature #40912: mgr/dashboard: REST API: review caching added
Actions #6

Updated by Ernesto Puerta about 3 years ago

Actions #7

Updated by Ernesto Puerta about 3 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Ernesto Puerta about 3 years ago

  • Assignee set to Waad Alkhoury
Actions #9

Updated by Ernesto Puerta about 3 years ago

Tasks identified

  1. Create injection hook for osd_map (C++, as Python-only might be misleading)
    1. Create mgr inject <map> <file> CLI command in C++ mgr
    2. Create fake "get(osd_map)" function: e.g.: by reading JSON file and passing it to PyFormatter ​
  2. Lightweight calls to 'get("osd_map") from CLI
    1. Creating a new mgr-module (e.g.: API CLI https://github.com/ceph/ceph/pull/34840)
    2. Expose 1 direct mgr API call (get()): ceph mgr api get osd_map
    3. Expose a benchemark mode ceph mgr api benchmark get osd_map <number_of_total_calls> <number_of_parallel_calls>
  3. Command to obtain total memory usage form ceph-mgr and modules
  4. Script for benchmarking:
    # benchmark
    for num_osd in 1...10000; do
      generate_fake_osd_map.sh $num_osd > my_fake_osd_map.json
      ceph mgr inject osd_map my_fake_osd_map.json
      ab -n 1000 -c 100 https://localhost:9043/metrics & # require to test shared vs per-module cache
      ab -n 1000 -c 100 https://localhost:11000/api/osd &
      ceph api_cli benchmark get osd_map 1000 100
      > # latency: best,avg,worst (seconds)
      > 0.1,0.5,1
    done
    

Expected Output:

  • TEST BEFORE (num_osds):
    for OSD_num = {1,2,4,8,16,32,64,128,256, 512,1024};
      obtain latency <best/avg/worst>
      obtain memory (hot cache)
    
  • TEST_AFTER (num_osds, Cache):
    for type_of_cache = {Py_shared_cache, Py_isolated_cache, C++_cache, ...};
      for OSD_num = {1,2,4,8,16,32,64,128,256, 512,1024};
        obtain latency <best/avg/worst>
        obtain memory (hot cache)
    

Notes:

Usually the worst latency will be obtained with a cold cache, while average and best cases are mostly influenced by warm and hot caches. The TTL selection will play a huge role into how much the 'worst case'/cold cache influences the average (a small TTL will exhibit more cache misses than a longer TTL).

Actions

Also available in: Atom PDF