Project

General

Profile

Actions

Bug #36453

open

Tasks #36451: mgr/dashboard: Scalability testing

mgr/dashboard: Some REST endpoints grow linearly with OSD count

Added by Ernesto Puerta over 5 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
General
Target version:
% Done:

75%

Source:
Tags:
Backport:
pacific quincy
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Endpoints providing information on OSD show linear size growth with OSD count.

  • /health grows 1-2 kB/OSD. It embeds all relevant Ceph maps (mgr_map, fs_map, osd_map, mon_map). Landing page is the main consumer of this endpoint, but it only needs osd_map to print "X total OSD (Y up, Z in)". Solution: calculate in a backend controller (/summary?) all the info needed for the Landing page.
  • /osd grows 1-2 kB/OSD.

Those would mean around 1-2 MB payloads every 5 seconds per dashboard instance for a 1000 OSD deployment. The resulting size can be highly varying as the payload is a plain-text JSON with lots of variable-length strings and numbers.

The above might impact on the following:
  • Networking: especially wireless ones.
  • Server-side: caching either Ceph-mgr results or endpoint payloads could improve performance.
    • Solution: cache ceph-mgr responses.
    • Solution: using HTTP cache control (single-user multiple-requests).
    • Solution: cache REST payloads internally (multiple-user).
  • Client-side: user experience may be negatively affected by parsing and processing large chunks of JSON.
    • Solution: lightweight data exchange formats (BSON, MessagePack).
    • Solution: delta JSON (PATCH-like).
    • Solution: more specialized REST Resources (instead of generalistic ones, like /health).
    • Solution: REST API pagination support.
    • Solution: REST API field selector/filtering support.

Subtasks 4 (1 open3 closed)

Bug #36674: mgr/dashboard: Enable compression for backend requestsClosedZack Cerza

Actions
Feature #36675: mgr/dashboard: Provide API endpoint providing minimal health dataClosedZack Cerza

Actions
Feature #37298: mgr/dashboard: Support a more compact data format (MessagePack, BSON)RejectedZack Cerza

Actions
Bug #56511: mgr/dashboard: paginate OSDsNewAashish Sharma

Actions

Related issues 1 (1 open0 closed)

Related to Dashboard - Feature #40907: mgr/dashboard: REST API improvementsNewErnesto Puerta

Actions
Actions #1

Updated by Ernesto Puerta over 5 years ago

  • Description updated (diff)
Actions #2

Updated by Lenz Grimmer over 5 years ago

  • Subject changed from Some REST endpoints grow linearly with OSD count to mgr/dashboard: Some REST endpoints grow linearly with OSD count
Actions #3

Updated by Ernesto Puerta over 5 years ago

  • Description updated (diff)
Actions #4

Updated by Ernesto Puerta over 5 years ago

  • % Done changed from 0 to 10
Actions #5

Updated by Ernesto Puerta over 5 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Zack Cerza over 5 years ago

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

Actions #7

Updated by Alfonso Martínez over 5 years ago

Zack Cerza wrote:

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

I agree: e.g. the endpoint should return "OSD summary" data.
My suggestion is to create several subtasks for every optimization step (in fact, we should have created a compression subtask and marked it as resolved).
As soon as this issue https://tracker.ceph.com/issues/24571 is resolved, we can get rid of audit_log and clog.

Actions #8

Updated by Ernesto Puerta over 5 years ago

Zack Cerza wrote:

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

/summary endpoint seems a good placeholder for minimalistic cooked data, while /health is just the opposite: a raw dump of all Ceph maps and metadata. So my vote goes in favor of moving as much as possible to /summary, so that the Landing page no longer needs pinging /health.

Another move on reducing data size, would be adding pagination to some (potentially) heavyweight resources (OSDs, RBD images/snapshots, config options, etc.). That would require more work both back and front-end, but would be more aligned with featuring complete RESTful API semantics (filtering, sorting, etc).

Actions #9

Updated by Zack Cerza over 5 years ago

I've got a WIP that can produce a minimal version of what /dashboard/health currently provides.

On my cluster, before the changes, we were transmitting 32KB (6KB compressed) every 5s. With my WIP, we're transmitting 11KB (1.4KB compressed). Once we can strip out the log data, we'll be down to 1KB (536B compressed) !

It's not ready for review yet, and I'm sure it will need updates to tests.

Also, I agree re: subtasks, but didn't have the time today to get to creating them.

Actions #10

Updated by Patrick Seidensal over 5 years ago

  • Description updated (diff)
Actions #11

Updated by Ernesto Puerta almost 5 years ago

  • Assignee set to Ernesto Puerta
Actions #12

Updated by Lenz Grimmer almost 5 years ago

  • Target version set to v15.0.0
Actions #13

Updated by Lenz Grimmer almost 5 years ago

  • Translation missing: en.field_tag_list set to performance
  • Tags deleted (scalability)
Actions #14

Updated by Ernesto Puerta about 4 years ago

  • Related to Feature #40907: mgr/dashboard: REST API improvements added
Actions #15

Updated by Ernesto Puerta over 2 years ago

  • Pull request ID set to 43771
Actions #16

Updated by Ernesto Puerta over 2 years ago

  • Target version changed from v15.0.0 to v17.0.0
  • Backport set to pacific
Actions #17

Updated by Alfonso Martínez over 2 years ago

  • Pull request ID changed from 43771 to 44120
Actions #18

Updated by Ernesto Puerta over 1 year ago

  • Status changed from In Progress to New
  • Assignee deleted (Ernesto Puerta)
  • Parent task deleted (#36451)
  • Backport changed from pacific to pacific quincy
Actions #19

Updated by Ernesto Puerta over 1 year ago

  • Parent task set to #36451
  • Pull request ID deleted (44120)
Actions

Also available in: Atom PDF