Project

General

Profile

Bug #36453

Tasks #36451: mgr/dashboard: Scalability testing

mgr/dashboard: Some REST endpoints grow linearly with OSD count

Added by Ernesto Puerta 10 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Category:
dashboard/general
Target version:
Start date:
11/01/2018
Due date:
% Done:

100%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Endpoints providing information on OSD show linear size growth with OSD count.

  • /health grows 1-2 kB/OSD. It embeds all relevant Ceph maps (mgr_map, fs_map, osd_map, mon_map). Landing page is the main consumer of this endpoint, but it only needs osd_map to print "X total OSD (Y up, Z in)". Solution: calculate in a backend controller (/summary?) all the info needed for the Landing page.
  • /osd grows 1-2 kB/OSD.

Those would mean around 1-2 MB payloads every 5 seconds per dashboard instance for a 1000 OSD deployment. The resulting size can be highly varying as the payload is a plain-text JSON with lots of variable-length strings and numbers.

The above might impact on the following:
  • Networking: especially wireless ones.
  • Server-side: caching either Ceph-mgr results or endpoint payloads could improve performance.
    • Solution: cache ceph-mgr responses.
    • Solution: using HTTP cache control (single-user multiple-requests).
    • Solution: cache REST payloads internally (multiple-user).
  • Client-side: user experience may be negatively affected by parsing and processing large chunks of JSON.
    • Solution: lightweight data exchange formats (BSON, MessagePack).
    • Solution: delta JSON (PATCH-like).
    • Solution: more specialized REST Resources (instead of generalistic ones, like /health).
    • Solution: REST API pagination support.
    • Solution: REST API field selector/filtering support.

Subtasks

Bug #36674: mgr/dashboard: Enable compression for backend requestsClosedZack Cerza

Feature #36675: mgr/dashboard: Provide API endpoint providing minimal health dataClosedZack Cerza

Feature #37298: mgr/dashboard: Support a more compact data format (MessagePack, BSON)RejectedZack Cerza

History

#1 Updated by Ernesto Puerta 10 months ago

  • Description updated (diff)

#2 Updated by Lenz Grimmer 10 months ago

  • Subject changed from Some REST endpoints grow linearly with OSD count to mgr/dashboard: Some REST endpoints grow linearly with OSD count

#3 Updated by Ernesto Puerta 10 months ago

  • Description updated (diff)

#4 Updated by Ernesto Puerta 10 months ago

  • % Done changed from 0 to 10

#5 Updated by Ernesto Puerta 10 months ago

  • Status changed from New to In Progress

#6 Updated by Zack Cerza 10 months ago

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

#7 Updated by Alfonso MH 10 months ago

Zack Cerza wrote:

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

I agree: e.g. the endpoint should return "OSD summary" data.
My suggestion is to create several subtasks for every optimization step (in fact, we should have created a compression subtask and marked it as resolved).
As soon as this issue https://tracker.ceph.com/issues/24571 is resolved, we can get rid of audit_log and clog.

#8 Updated by Ernesto Puerta 10 months ago

Zack Cerza wrote:

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

/summary endpoint seems a good placeholder for minimalistic cooked data, while /health is just the opposite: a raw dump of all Ceph maps and metadata. So my vote goes in favor of moving as much as possible to /summary, so that the Landing page no longer needs pinging /health.

Another move on reducing data size, would be adding pagination to some (potentially) heavyweight resources (OSDs, RBD images/snapshots, config options, etc.). That would require more work both back and front-end, but would be more aligned with featuring complete RESTful API semantics (filtering, sorting, etc).

#9 Updated by Zack Cerza 10 months ago

I've got a WIP that can produce a minimal version of what /dashboard/health currently provides.

On my cluster, before the changes, we were transmitting 32KB (6KB compressed) every 5s. With my WIP, we're transmitting 11KB (1.4KB compressed). Once we can strip out the log data, we'll be down to 1KB (536B compressed) !

It's not ready for review yet, and I'm sure it will need updates to tests.

Also, I agree re: subtasks, but didn't have the time today to get to creating them.

#10 Updated by Patrick Seidensal 10 months ago

  • Description updated (diff)

#11 Updated by Ernesto Puerta 3 months ago

  • Assignee set to Ernesto Puerta

#12 Updated by Lenz Grimmer about 1 month ago

  • Target version set to v15.0.0

#13 Updated by Lenz Grimmer about 1 month ago

  • Tags set to performance
  • Tags deleted (scalability)

Also available in: Atom PDF