Bug #36453
openTasks #36451: mgr/dashboard: Scalability testing
mgr/dashboard: Some REST endpoints grow linearly with OSD count
75%
Description
Endpoints providing information on OSD show linear size growth with OSD count.
/health
grows 1-2 kB/OSD. It embeds all relevant Ceph maps (mgr_map, fs_map, osd_map, mon_map). Landing page is the main consumer of this endpoint, but it only needs osd_map to print "X total OSD (Y up, Z in)". Solution: calculate in a backend controller (/summary
?) all the info needed for the Landing page./osd
grows 1-2 kB/OSD.
Those would mean around 1-2 MB payloads every 5 seconds per dashboard instance for a 1000 OSD deployment. The resulting size can be highly varying as the payload is a plain-text JSON with lots of variable-length strings and numbers.
The above might impact on the following:- Networking: especially wireless ones.
- Solution: enable compression on the wire FIXED: https://github.com/ceph/ceph/pull/24727
- Solution: delta JSON (PATCH-like).
- Solution: more compact data exchange formats (BSON, MessagePack).
- Server-side: caching either Ceph-mgr results or endpoint payloads could improve performance.
- Solution: cache ceph-mgr responses.
- Solution: using HTTP cache control (single-user multiple-requests).
- Solution: cache REST payloads internally (multiple-user).
- Client-side: user experience may be negatively affected by parsing and processing large chunks of JSON.
- Solution: lightweight data exchange formats (BSON, MessagePack).
- Solution: delta JSON (PATCH-like).
- Solution: more specialized REST Resources (instead of generalistic ones, like
/health
). - Solution: REST API pagination support.
- Solution: REST API field selector/filtering support.
Updated by Lenz Grimmer over 5 years ago
- Subject changed from Some REST endpoints grow linearly with OSD count to mgr/dashboard: Some REST endpoints grow linearly with OSD count
Updated by Ernesto Puerta over 5 years ago
- Status changed from New to In Progress
Updated by Zack Cerza over 5 years ago
Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.
I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log
and clog
.
Updated by Alfonso MartÃnez over 5 years ago
Zack Cerza wrote:
Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.
I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is
audit_log
andclog
.
I agree: e.g. the endpoint should return "OSD summary" data.
My suggestion is to create several subtasks for every optimization step (in fact, we should have created a compression subtask and marked it as resolved).
As soon as this issue https://tracker.ceph.com/issues/24571 is resolved, we can get rid of audit_log
and clog
.
Updated by Ernesto Puerta over 5 years ago
Zack Cerza wrote:
Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.
I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is
audit_log
andclog
.
/summary endpoint seems a good placeholder for minimalistic cooked data, while /health is just the opposite: a raw dump of all Ceph maps and metadata. So my vote goes in favor of moving as much as possible to /summary, so that the Landing page no longer needs pinging /health.
Another move on reducing data size, would be adding pagination to some (potentially) heavyweight resources (OSDs, RBD images/snapshots, config options, etc.). That would require more work both back and front-end, but would be more aligned with featuring complete RESTful API semantics (filtering, sorting, etc).
Updated by Zack Cerza over 5 years ago
I've got a WIP that can produce a minimal version of what /dashboard/health currently provides.
On my cluster, before the changes, we were transmitting 32KB (6KB compressed) every 5s. With my WIP, we're transmitting 11KB (1.4KB compressed). Once we can strip out the log data, we'll be down to 1KB (536B compressed) !
It's not ready for review yet, and I'm sure it will need updates to tests.
Also, I agree re: subtasks, but didn't have the time today to get to creating them.
Updated by Lenz Grimmer almost 5 years ago
- Translation missing: en.field_tag_list set to performance
- Tags deleted (
scalability)
Updated by Ernesto Puerta about 4 years ago
- Related to Feature #40907: mgr/dashboard: REST API improvements added
Updated by Ernesto Puerta over 2 years ago
- Target version changed from v15.0.0 to v17.0.0
- Backport set to pacific
Updated by Alfonso MartÃnez over 2 years ago
- Pull request ID changed from 43771 to 44120
Updated by Ernesto Puerta over 1 year ago
- Status changed from In Progress to New
- Assignee deleted (
Ernesto Puerta) - Parent task deleted (
#36451) - Backport changed from pacific to pacific quincy
Updated by Ernesto Puerta over 1 year ago
- Parent task set to #36451
- Pull request ID deleted (
44120)