Project

General

Profile

Feature #61778

mgr/mds_partitioner: add MDS partitioner module in MGR

Added by Yongseok Oh 8 months ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:

Description

This idea is based on our presentation in Cephalocon2023. (Please refer to the slides in https://static.sched.com/hosted_files/ceph2023/d9/Cephalocon2023_LINE_Optimizing_CephFS.pdf.)
We have presented our in-house partitioner written in Python along with bal_rank_mask. Specifically, we employ combining approach to selectively use dynamic partitioning to handle heavy workloads (e.g., huge files and working set size) and static partitioning to process light/moderate workloads. We also distribute subdirs based on workload characteristics. Compared to typical static pinning, we can balance metadata workloads with minimizing metadata movements across MDSs. Unfortunately, our in-house partitioner is unavailable as open source because it is optimized for our environment. Therefore, it needs to be revised and reimplemented as a MGR module for Ceph community.

Here is summary of our mds_partitioner module.

Enable mds_partitioner
$ ceph mgr module enable mds_partitioner

Analyze client workloads ontained from MDSs
$ ceph mds_partitioner analyze start

Report analysis results and recommend optimal the number of MDSs
$ ceph mds_partitioner analyze status

Start partitioning
$ ceph mds_partitioner partition start

Report partitioning status
$ ceph mds_partitioner partition status

Partition module can be executed through `ceph mgr module enable mds_partitioner`. Executing `ceph mds_partitioner analyze` starts to distribute subdirs to multiple MDSs according to workloads. Then, in order to calculate the optimal distribution, metrics such as perf, rentries, and wss are obtained from the MDS balancer. After that, a bin packing algorithm is used to determine the MDS placement of subdirs. The wss tracker will be implemented in the future as needed. Then, through `ceph mds_partitioner analyze status`, we can confirm analysis results and how to distribute subdirs to MDSs. After that, actual partitioning is executed through `ceph mds_partitioner partition start`. To move subdirs, ceph.dir.pin and ceph.dir.bal.mask vxattr are simply employed. Additionally, ceph.dir.bal.mask needs to be implemented. See tracker https://tracker.ceph.com/issues/61777. Finally, you can check the partitioning progress using `ceph mds_partitioner partition status`.

Please refer to additional slides for detailed information. https://github.com/yongseokoh/presentation/blob/main/A_New_Parititioning_for_CephFS.pdf


Subtasks

Bug #62158: mds: quick suspend or abort metadata migrationNew


Related issues

Related to CephFS - Tasks #62159: qa: evaluate mds_partitioner In Progress
Related to CephFS - Feature #62157: mds: working set size tracker In Progress

History

#1 Updated by Venky Shankar 8 months ago

Thanks for the feature proposal. CephFS team will go through the proposal asap.

#2 Updated by Venky Shankar 8 months ago

Venky Shankar wrote:

Thanks for the feature proposal. CephFS team will go through the proposal asap.

I'm going through the proposal today and tomorrow. Will update here.

#3 Updated by Venky Shankar 8 months ago

Hi Yongseok,

Yongseok Oh wrote:

This idea is based on our presentation in Cephalocon2023. (Please refer to the slides in https://static.sched.com/hosted_files/ceph2023/d9/Cephalocon2023_LINE_Optimizing_CephFS.pdf.)

This link is unavailable (at least for me). But, that's fine - your presentation is available in cephalocon website.

We have presented our in-house partitioner written in Python along with bal_rank_mask. Specifically, we employ combining approach to selectively use dynamic partitioning to handle heavy workloads (e.g., huge files and working set size) and static partitioning to process light/moderate workloads. We also distribute subdirs based on workload characteristics. Compared to typical static pinning, we can balance metadata workloads with minimizing metadata movements across MDSs. Unfortunately, our in-house partitioner is unavailable as open source because it is optimized for our environment. Therefore, it needs to be revised and reimplemented as a MGR module for Ceph community.

Here is summary of our mds_partitioner module.

Enable mds_partitioner
$ ceph mgr module enable mds_partitioner

Analyze client workloads ontained from MDSs
$ ceph mds_partitioner analyze start

So this step will gather pieces of information from the MDS namely perf stats, number of files+dirs, etc., right? Sorry, but what's "wss"? I didn't find anything sounding/abbreviating like that under src/mds.

Also, the "bin packaging algorithm" is the one detailed here I guess?

https://en.wikipedia.org/wiki/Bin_packing_problem

Report analysis results and recommend optimal the number of MDSs
$ ceph mds_partitioner analyze status

Is there an option to override/tune the recommendation at this point, since this essentially recommends the number of active ranks and the distribution strategy. I assume this step details which (directories) should be statically pinned and which ones should use the dynamic partition based on bal rank mask.

Start partitioning
$ ceph mds_partitioner partition start

Report partitioning status
$ ceph mds_partitioner partition status

Partition module can be executed through `ceph mgr module enable mds_partitioner`. Executing `ceph mds_partitioner analyze` starts to distribute subdirs to multiple MDSs according to workloads. Then, in order to calculate the optimal distribution, metrics such as perf, rentries, and wss are obtained from the MDS balancer. After that, a bin packing algorithm is used to determine the MDS placement of subdirs. The wss tracker will be implemented in the future as needed. Then, through `ceph mds_partitioner analyze status`, we can confirm analysis results and how to distribute subdirs to MDSs. After that, actual partitioning is executed through `ceph mds_partitioner partition start`. To move subdirs, ceph.dir.pin and ceph.dir.bal.mask vxattr are simply employed. Additionally, ceph.dir.bal.mask needs to be implemented. See tracker https://tracker.ceph.com/issues/61777. Finally, you can check the partitioning progress using `ceph mds_partitioner partition status`.

I haven't seen your PR yet (https://github.com/ceph/ceph/pull/52373), but, I guess directories which have this xattr explicitly set override the config set in the mdsmap. Also, do subdirs inherit this xattr from the parent? I'll have a look at the change, but we can start discussing here.

Please refer to additional slides for detailed information. https://github.com/yongseokoh/presentation/blob/main/A_New_Parititioning_for_CephFS.pdf

Overall, I like the idea +1 :)

#4 Updated by Venky Shankar 8 months ago

Another suggestion/feedback - Should the module also persist (say) the last 10 partitioning strategies? I presume when this feature is put to use, users are going to analyze and re-partition the MDSs multiple times over time, and if and when we get reports on possible sub-optimal partitioning strategy or performance degradation due to balancer misbehaving, this (persisted) information might just be useful.

What do you think?

#5 Updated by Yongseok Oh 8 months ago

Hi Venky,

Venky Shankar wrote:

Hi Yongseok,

Yongseok Oh wrote:

This idea is based on our presentation in Cephalocon2023. (Please refer to the slides in https://static.sched.com/hosted_files/ceph2023/d9/Cephalocon2023_LINE_Optimizing_CephFS.pdf .)

This link is unavailable (at least for me). But, that's fine - your presentation is available in cephalocon website.

I've caused you inconvenience. You can access it by removing the period from the link. https://static.sched.com/hosted_files/ceph2023/d9/Cephalocon2023_LINE_Optimizing_CephFS.pdf

We have presented our in-house partitioner written in Python along with bal_rank_mask. Specifically, we employ combining approach to selectively use dynamic partitioning to handle heavy workloads (e.g., huge files and working set size) and static partitioning to process light/moderate workloads. We also distribute subdirs based on workload characteristics. Compared to typical static pinning, we can balance metadata workloads with minimizing metadata movements across MDSs. Unfortunately, our in-house partitioner is unavailable as open source because it is optimized for our environment. Therefore, it needs to be revised and reimplemented as a MGR module for Ceph community.

Here is summary of our mds_partitioner module.

Enable mds_partitioner
$ ceph mgr module enable mds_partitioner

Analyze client workloads ontained from MDSs
$ ceph mds_partitioner analyze start

So this step will gather pieces of information from the MDS namely perf stats, number of files+dirs, etc., right? Sorry, but what's "wss"? I didn't find anything sounding/abbreviating like that under src/mds.

Load metric is calculated by collecting requests count, dirs + files count, and wss (e.g., working set size https://en.wikipedia.org/wiki/Working_set_size) for each subdir. wss is not currently provided in MDS, so implementation is required in the future. Sometimes there are many actual dirs + files, but only some of them are accessed, and vice versa. Therefore, it is good to consider wss as well as the number of files + dirs. If wss is large, it consumes a lot of MDS cache and causes misses, resulting in performance degradation. Therefore, various factors such as performance, dris + files, and wss must be considered. I would like to develop various policies and let users choose and use them. Additionally, since collecting metrics for all directories is difficult, it would be nice to have an option to collect only roots of subvolumes. In the case of Openstack Manila, shares are allocated in the form of /volumes/_nogroup/$subvol. How about collecting metrics in subvolume (or share) units and implementing distribution to MDS?

Also, the "bin packaging algorithm" is the one detailed here I guess?

https://en.wikipedia.org/wiki/Bin_packing_problem

This meant developing an allocation policy based on the current MDBalancer's load balancing. For example, there are 2 MDSs, the average load is 100, and rank 0 is handling 120 loads. At this time, find a subdir that uses a load of about 20 of rank 0 and send it to rank 1.

Report analysis results and recommend optimal the number of MDSs
$ ceph mds_partitioner analyze status

Is there an option to override/tune the recommendation at this point, since this essentially recommends the number of active ranks and the distribution strategy. I assume this step details which (directories) should be statically pinned and which ones should use the dynamic partition based on bal rank mask.

That's a very good point. It is better to allow the user to change some options before starting the actual partition. For example, a strategy was created to move the directory /volumes/_nogroup/share_1 from rank1 -> rank2, however, we want to leave the subdir at rank1 for performance reasons rather than moving it. And it seems that some administrators can manually set ranks for specific subdirs and ranks and set static/dynamic policies. Let's see if there are any other use cases.

Start partitioning
$ ceph mds_partitioner partition start

Report partitioning status
$ ceph mds_partitioner partition status

Partition module can be executed through `ceph mgr module enable mds_partitioner`. Executing `ceph mds_partitioner analyze` starts to distribute subdirs to multiple MDSs according to workloads. Then, in order to calculate the optimal distribution, metrics such as perf, rentries, and wss are obtained from the MDS balancer. After that, a bin packing algorithm is used to determine the MDS placement of subdirs. The wss tracker will be implemented in the future as needed. Then, through `ceph mds_partitioner analyze status`, we can confirm analysis results and how to distribute subdirs to MDSs. After that, actual partitioning is executed through `ceph mds_partitioner partition start`. To move subdirs, ceph.dir.pin and ceph.dir.bal.mask vxattr are simply employed. Additionally, ceph.dir.bal.mask needs to be implemented. See tracker https://tracker.ceph.com/issues/61777. Finally, you can check the partitioning progress using `ceph mds_partitioner partition status`.

I haven't seen your PR yet (https://github.com/ceph/ceph/pull/52373), but, I guess directories which have this xattr explicitly set override the config set in the mdsmap. Also, do subdirs inherit this xattr from the parent? I'll have a look at the change, but we can start discussing here.

Yes, that's right. This PR is an extended version of bal_rank_mask from the existing config set mdsmap. This xattr value is inherited from the parent. If the xattr of '/' is 0xf, the root directory is dynamically distributed in ranks 0 to 3. This PR is an early version and can be further optimized through reviews.

Please refer to additional slides for detailed information. https://github.com/yongseokoh/presentation/blob/main/A_New_Parititioning_for_CephFS.pdf

Overall, I like the idea +1 :)

#6 Updated by Yongseok Oh 8 months ago

Venky Shankar wrote:

Another suggestion/feedback - Should the module also persist (say) the last 10 partitioning strategies? I presume when this feature is put to use, users are going to analyze and re-partition the MDSs multiple times over time, and if and when we get reports on possible sub-optimal partitioning strategy or performance degradation due to balancer misbehaving, this (persisted) information might just be useful.

What do you think?

I think It's a very helpful and useful feature. Supporting the function to save the last N history seems to be of great help to users in various performance optimization or debugging.

#7 Updated by Venky Shankar 7 months ago

Just FYI - https://github.com/ceph/ceph/pull/52196 disables the balancer by default since it has been a source of performance issues lately (due to the inefficiencies we already know of). We should make sure we enable the balancer when using the partitioner module.

#8 Updated by Venky Shankar 6 months ago

  • Related to Tasks #62159: qa: evaluate mds_partitioner added

#9 Updated by Venky Shankar 6 months ago

Also available in: Atom PDF