Bug #66055
openmgr/dashboard: rbd image listing scalabilty
0%
Description
When accessing the `block/rbd` page and listing the images, if the rbd images are big (eg. 25 TiB, 50 TiB, 100 TiB...) the load times to retrieve these will be significant, on a huge cluster with multiple of these images the listing of the images might not even complete at all.
Files
Updated by Pedro González Gómez about 1 month ago
- Priority changed from Normal to High
Updated by Ernesto Puerta about 1 month ago
If I'm not wrong, this is almost impossible to fix from the Dashboard side. The info we provide for each individual RBD image is much, much comprehensive that what rbd info
or any other RBD CLI command does alone. Among the many things we try to provide is the real storage usage of each RBD image. For doing that we directly invoke a 'low-level' RBD call diff-iterate()
(in services/rbd.py
_rbd_disk_usage
), which basically has to go through all the objects that make up an RBD image. On large images this means lots of objects to iterate through.
There's an RBD CLI command rbd du
that provides both the provisioned and the "used" storage. Unfortunately, that command is not available from the Python librbd binding. Additionally, the help of that command shows a --exact
option with the caveat "slow":
# rbd help du usage: rbd du [--pool <pool>] [--namespace <namespace>] [--image <image>] [--snap <snap>] [--format <format>] [--pretty-format] [--from-snap <from-snap>] [--exact] [--merge-snapshots] <image-or-snap-spec> Show disk usage stats for pool, image or snapshot. Positional arguments <image-or-snap-spec> image or snapshot specification (example: [<pool-name>/[<namespace>/]]<image-name>[@<snap-name>]) Optional arguments -p [ --pool ] arg pool name --namespace arg namespace name --image arg image name --snap arg snapshot name --format arg output format (plain, json, or xml) [default: plain] --pretty-format pretty formatting (json and xml) --from-snap arg snapshot starting point >>--exact compute exact disk usage (slow) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< --merge-snapshots merge snapshot sizes with its image
It'd be great if Dashboard could have access to that "fast" usage from the Python librbd module. The "slow" perhaps can be calculated & displayed on-demand, in the "details" section of each RBD image.
As a reference these are the sample outputs of fast and exact disk usages:
# rbd du NAME PROVISIONED USED demo_datastore 1 GiB 136 MiB demo_datastore_clone@demo_datastore_clone_30042024142554 1 GiB 136 MiB demo_datastore_clone 1 GiB 92 MiB demo_datastore_thin_clone@demo_datastore_thin_clone_30042024142606 1 GiB 124 MiB demo_datastore_thin_clone 770 GiB 140 MiB epuertat-2024-04-29@new_one 1 GiB 140 MiB epuertat-2024-04-29 1 GiB 32 MiB epuertat-2024-04-29_240429201629 1 GiB 136 MiB epuertat-2024-04-29_240429201934 1 GiB 124 MiB test@test_snapshot_name 1 GiB 140 MiB test 1 GiB 68 MiB test2 2 GiB 0 B <TOTAL> 778 GiB 1.2 GiB
# rbd du --exact NAME PROVISIONED USED demo_datastore 1 GiB 99 MiB demo_datastore_clone@demo_datastore_clone_30042024142554 1 GiB 136 MiB demo_datastore_clone 1 GiB 17 MiB demo_datastore_thin_clone@demo_datastore_thin_clone_30042024142606 1 GiB 93 MiB demo_datastore_thin_clone 770 GiB 44 MiB epuertat-2024-04-29@new_one 1 GiB 103 MiB epuertat-2024-04-29 1 GiB 43 KiB epuertat-2024-04-29_240429201629 1 GiB 100 MiB epuertat-2024-04-29_240429201934 1 GiB 124 MiB test@test_snapshot_name 1 GiB 103 MiB test 1 GiB 10 MiB test2 2 GiB 0 B <TOTAL> 778 GiB 828 MiB
Updated by Ernesto Puerta about 1 month ago
@Pedro González Gómez : we need help from RBD team here (CC: @Ilya Dryomov).
Updated by Ilya Dryomov about 1 month ago
Ernesto Puerta wrote in #note-2:
It'd be great if Dashboard could have access to that "fast" usage from the Python librbd module. The "slow" perhaps can be calculated & displayed on-demand, in the "details" section of each RBD image.
As long as whole_object=True is passed to diff_iterate(), which seems to be the case here (_rbd_disk_usage() parameter both defaults to True and the only caller passes True explicitly), you opt in into the "fast" mode. The other condition for the "fast" mode to activate (the mode is chosen on a per-image basis) is that the image has fast-diff feature enabled. fast-diff feature should be enabled by default, so the Dashboard should be all set. Try passing whole_object=False and see how long it takes in comparison.
Updated by Ilya Dryomov about 1 month ago
Pedro González Gómez wrote:
When accessing the `block/rbd` page and listing the images, if the rbd images are big (eg. 25 TiB, 50 TiB, 100 TiB...) the load times to retrieve these will be significant, on a huge cluster with multiple of these images the listing of the images might not even complete at all.
It might take a considerable amount of time, but with "fast" mode there won't be any load on the cluster. I would expect it to bottleneck on CPU on the client side. Check the thread that calls diff_iterate() in the Dashboard -- it should be spinning at 100% with no network activity in the process except briefly when moving to the next image on the list.
Updated by Zack Perry 13 days ago
Good Morning all,
Appreciate the work being done on this! I was speaking with Pedro regarding this issue on two systems I'm dealing with and I wanted to check in to see if there were any updates or if there potential next steps
Thank you again!