Bug #36453
open
Tasks #36451: mgr/dashboard: Scalability testing
mgr/dashboard: Some REST endpoints grow linearly with OSD count
Added by Ernesto Puerta over 5 years ago.
Updated over 1 year ago.
Description
Endpoints providing information on OSD show linear size growth with OSD count.
/health
grows 1-2 kB/OSD. It embeds all relevant Ceph maps (mgr_map, fs_map, osd_map, mon_map). Landing page is the main consumer of this endpoint, but it only needs osd_map to print "X total OSD (Y up, Z in)". Solution: calculate in a backend controller (/summary
?) all the info needed for the Landing page.
/osd
grows 1-2 kB/OSD.
Those would mean around 1-2 MB payloads every 5 seconds per dashboard instance for a 1000 OSD deployment. The resulting size can be highly varying as the payload is a plain-text JSON with lots of variable-length strings and numbers.
The above might impact on the following:
- Networking: especially wireless ones.
- Server-side: caching either Ceph-mgr results or endpoint payloads could improve performance.
- Solution: cache ceph-mgr responses.
- Solution: using HTTP cache control (single-user multiple-requests).
- Solution: cache REST payloads internally (multiple-user).
- Client-side: user experience may be negatively affected by parsing and processing large chunks of JSON.
- Solution: lightweight data exchange formats (BSON, MessagePack).
- Solution: delta JSON (PATCH-like).
- Solution: more specialized REST Resources (instead of generalistic ones, like
/health
).
- Solution: REST API pagination support.
- Solution: REST API field selector/filtering support.
Related issues
1 (1 open — 0 closed)
- Description updated (diff)
- Subject changed from Some REST endpoints grow linearly with OSD count to mgr/dashboard: Some REST endpoints grow linearly with OSD count
- Description updated (diff)
- % Done changed from 0 to 10
- Status changed from New to In Progress
Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.
I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log
and clog
.
Zack Cerza wrote:
Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.
I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log
and clog
.
I agree: e.g. the endpoint should return "OSD summary" data.
My suggestion is to create several subtasks for every optimization step (in fact, we should have created a compression subtask and marked it as resolved).
As soon as this issue https://tracker.ceph.com/issues/24571 is resolved, we can get rid of audit_log
and clog
.
Zack Cerza wrote:
Now that we have some compression at least, I'm wondering if a good next step could be reducing the amount of data returned by /dashboard/health to only what HealthComponent requires. This could be done by modifying the /dashboard/health endpoint directly, or by creating another, similarly-named endpoint that provides the whittled-down data.
I've been experimenting with this on a vstart cluster, and without fully optimizing, my results are just under 40% of the size of the current response. This is of course with just 3 OSDs and one FS. The vast majority of what remains is audit_log
and clog
.
/summary endpoint seems a good placeholder for minimalistic cooked data, while /health is just the opposite: a raw dump of all Ceph maps and metadata. So my vote goes in favor of moving as much as possible to /summary, so that the Landing page no longer needs pinging /health.
Another move on reducing data size, would be adding pagination to some (potentially) heavyweight resources (OSDs, RBD images/snapshots, config options, etc.). That would require more work both back and front-end, but would be more aligned with featuring complete RESTful API semantics (filtering, sorting, etc).
I've got a WIP that can produce a minimal version of what /dashboard/health currently provides.
On my cluster, before the changes, we were transmitting 32KB (6KB compressed) every 5s. With my WIP, we're transmitting 11KB (1.4KB compressed). Once we can strip out the log data, we'll be down to 1KB (536B compressed) !
It's not ready for review yet, and I'm sure it will need updates to tests.
Also, I agree re: subtasks, but didn't have the time today to get to creating them.
- Description updated (diff)
- Assignee set to Ernesto Puerta
- Target version set to v15.0.0
- Translation missing: en.field_tag_list set to performance
- Tags deleted (
scalability)
- Pull request ID set to 43771
- Target version changed from v15.0.0 to v17.0.0
- Backport set to pacific
- Pull request ID changed from 43771 to 44120
- Status changed from In Progress to New
- Assignee deleted (
Ernesto Puerta)
- Parent task deleted (
#36451)
- Backport changed from pacific to pacific quincy
- Parent task set to #36451
- Pull request ID deleted (
44120)
Also available in: Atom
PDF