Bug #36453: mgr/dashboard: Some REST endpoints grow linearly with OSD count - Dashboard - Ceph

Actions

Copy link

Bug #36453

open

Tasks #36451: mgr/dashboard: Scalability testing

mgr/dashboard: Some REST endpoints grow linearly with OSD count

Added by Ernesto Puerta over 5 years ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

General

Target version:

Ceph - v17.0.0

% Done:

75%

Source:

Tags:

Backport:

pacific quincy

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Endpoints providing information on OSD show linear size growth with OSD count.

/health grows 1-2 kB/OSD. It embeds all relevant Ceph maps (mgr_map, fs_map, osd_map, mon_map). Landing page is the main consumer of this endpoint, but it only needs osd_map to print "X total OSD (Y up, Z in)". Solution: calculate in a backend controller (/summary?) all the info needed for the Landing page.
/osd grows 1-2 kB/OSD.

Those would mean around 1-2 MB payloads every 5 seconds per dashboard instance for a 1000 OSD deployment. The resulting size can be highly varying as the payload is a plain-text JSON with lots of variable-length strings and numbers.

The above might impact on the following:

Networking: especially wireless ones.
- Solution: enable compression on the wire FIXED: https://github.com/ceph/ceph/pull/24727
- Solution: delta JSON (PATCH-like).
- Solution: more compact data exchange formats (BSON, MessagePack).
Server-side: caching either Ceph-mgr results or endpoint payloads could improve performance.
- Solution: cache ceph-mgr responses.
- Solution: using HTTP cache control (single-user multiple-requests).
- Solution: cache REST payloads internally (multiple-user).
Client-side: user experience may be negatively affected by parsing and processing large chunks of JSON.
- Solution: lightweight data exchange formats (BSON, MessagePack).
- Solution: delta JSON (PATCH-like).
- Solution: more specialized REST Resources (instead of generalistic ones, like /health).
- Solution: REST API pagination support.
- Solution: REST API field selector/filtering support.

Subtasks 4 (1 open — 3 closed)

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Ernesto Puerta over 5 years ago

Description updated (diff)

Actions

Copy link

Updated by Lenz Grimmer over 5 years ago

Subject changed from Some REST endpoints grow linearly with OSD count to mgr/dashboard: Some REST endpoints grow linearly with OSD count

Actions

Copy link

Updated by Ernesto Puerta over 5 years ago

Description updated (diff)

Actions

Copy link

Updated by Ernesto Puerta over 5 years ago

% Done changed from 0 to 10

Actions

Copy link

Updated by Ernesto Puerta over 5 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Zack Cerza over 5 years ago

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

Actions

Copy link

Updated by Alfonso Martínez over 5 years ago

Zack Cerza wrote:

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

I agree: e.g. the endpoint should return "OSD summary" data.
My suggestion is to create several subtasks for every optimization step (in fact, we should have created a compression subtask and marked it as resolved).
As soon as this issue https://tracker.ceph.com/issues/24571 is resolved, we can get rid of audit_log and clog.

Actions

Copy link

Updated by Ernesto Puerta over 5 years ago

Zack Cerza wrote:

Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.

I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log and clog.

/summary endpoint seems a good placeholder for minimalistic cooked data, while /health is just the opposite: a raw dump of all Ceph maps and metadata. So my vote goes in favor of moving as much as possible to /summary, so that the Landing page no longer needs pinging /health.

Another move on reducing data size, would be adding pagination to some (potentially) heavyweight resources (OSDs, RBD images/snapshots, config options, etc.). That would require more work both back and front-end, but would be more aligned with featuring complete RESTful API semantics (filtering, sorting, etc).

Actions

Copy link

Updated by Zack Cerza over 5 years ago

I've got a WIP that can produce a minimal version of what /dashboard/health currently provides.

On my cluster, before the changes, we were transmitting 32KB (6KB compressed) every 5s. With my WIP, we're transmitting 11KB (1.4KB compressed). Once we can strip out the log data, we'll be down to 1KB (536B compressed) !

It's not ready for review yet, and I'm sure it will need updates to tests.

Also, I agree re: subtasks, but didn't have the time today to get to creating them.

Actions

Copy link

#10