Project

General

Profile

Feature #41302

mds: add ephemeral random and distributed export pins

Added by Patrick Donnelly about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Urgent
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
nautilus
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:

Description

Background: export pins [1] are an effective way to distribute metadata load for large workloads without the metadata balancer interfering. We have found that you can achieve expected linear metadata throughput scaling (with MDS count). However, teaching users or applications how to effectively apply export pins is not always possible.

So, let's make it possible for the MDS to apply a pinning strategy to subtrees. The pinning does not need to be perfect; even a poor distribution is still generally better than single MDS performance. Caveat: some care must be taken to not create too many subtrees as this could degrade performance.

There are two parts to this ticket:

1. Create a new persistent (directory) inode field "export_ephemeral_distributed". This applies only to the directory and is not hierarchical. Any direct descendant directory (i.e. a child directory) has an ephemeral export pin applied to it according to a consistent hash [2] of the child directory inode number. This involves each directory knowing its immediate parent's "ephemeral_export_distributed" value. Any MDS rank can figure out where such a directory should be pinned by knowing the hash (module 360) and the number of ranks (which can be linearly distributed points on the circle).

2. Create a new persistent (directory) inode field "export_ephemeral_random". This is hierarchical like "export_pin". Any CDir (fragment!) loaded into the cache may be ephemerally pinned to a random rank. Like "export_ephemeral_distributed", the random rank is determined by a consistent hash. Notably, if another rank is added or removed then the ephemerally pinned subtrees should be uniformly distributed across the ranks. A directory fragment (CDir) that is pinned in this way will remain pinned for as long as it is in the distributed MDS cache (i.e. some MDS has it in memory).

Tests should verify

  • that "export_ephemeral_distributed" is approximately uniform and only applies to the direct descendants
  • that "export_ephemeral_random" is heirarchical
  • that changing max_mds redistributes approximately 1/N or less of the ephemerally pinned subtrees
  • that changing max_mds does not create or remove ephemerally pinned subtrees
  • that export_pin overrides an ephemeral pin on a parent directory
  • that ephemeral pins override a parent export_pin
  • that ephemeral pins can be disabled for a subtree by setting export_ephemeral_random=0.0

Performance testing should

  • validate that large ephemeral pin changes (due to max_mds changes) do not destabilize the MDS cluster
  • identify any performance degradation caused by too many pinned subtrees

[1] https://docs.ceph.com/docs/master/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2] https://en.wikipedia.org/wiki/Consistent_hashing


Related issues

Blocks fs - Bug #41541: mgr/volumes: ephemerally pin volumes New

History

#1 Updated by Patrick Donnelly about 1 month ago

It's worth noting that the only difference between the two options is that export_ephemeral_distributed is not hierarchical and applied only to direct descendant directories 100% of the time whereas export_ephemeral_random is hierarchical and randomly applied (i.e. not 100% of the time). I'm open to names that better reflect how these two options are connected in behavior.

#2 Updated by Patrick Donnelly about 1 month ago

  • Description updated (diff)

#3 Updated by Patrick Donnelly about 1 month ago

  • Description updated (diff)

#4 Updated by Patrick Donnelly about 1 month ago

  • Description updated (diff)

#5 Updated by Patrick Donnelly about 1 month ago

Here are some scripts shared by Dan from CERN that can be used to manually test random subtree pinning: https://github.com/cernceph/ceph-scripts/tree/master/tools/cephfs

#6 Updated by Patrick Donnelly about 1 month ago

  • Description updated (diff)

#7 Updated by Patrick Donnelly about 1 month ago

  • Description updated (diff)

#8 Updated by Sidharth Anupkrishnan about 1 month ago

Nice!
I have a doubt regarding how we could use consistent hashing for the 2nd case: "export_ephemeral_random" pinning. Since we are pinning entire subtrees, how should we hash such that every directory/file in the subtree gets pinned to a particular MDS rank( a particular segment in the circle in consistent hashing terms ) i.e preserve hierarchical locality?

#9 Updated by Patrick Donnelly about 1 month ago

  • Assignee set to Sidharth Anupkrishnan

Sidharth, I've discussed this with Doug and we'll be assigning this to you.

Sidharth Anupkrishnan wrote:

Nice!
I have a doubt regarding how we could use consistent hashing for the 2nd case: "export_ephemeral_random" pinning. Since we are pinning entire subtrees, how should we hash such that every directory/file in the subtree gets pinned to a particular MDS rank( a particular segment in the circle in consistent hashing terms ) i.e preserve hierarchical locality?

We would never want to set export_ephemeral_random=1.0 such that every CDir (directory fragment) is ephemerally pinned. That indeed would compromise any benefits from metadata locality. The percentage would probably be low like 1% or 5%.

Keep in mind that the random chance that a CDir is ephemerally pinned is determined when the CDir is created or loaded into memory. Once it has an ephemeral pin, it remains that way for the duration it is in an MDS cache.

#10 Updated by Sidharth Anupkrishnan about 1 month ago

Patrick Donnelly wrote:

Sidharth, I've discussed this with Doug and we'll be assigning this to you.

Sidharth Anupkrishnan wrote:

Nice!
I have a doubt regarding how we could use consistent hashing for the 2nd case: "export_ephemeral_random" pinning. Since we are pinning entire subtrees, how should we hash such that every directory/file in the subtree gets pinned to a particular MDS rank( a particular segment in the circle in consistent hashing terms ) i.e preserve hierarchical locality?

We would never want to set export_ephemeral_random=1.0 such that every CDir (directory fragment) is ephemerally pinned. That indeed would compromise any benefits from metadata locality. The percentage would probably be low like 1% or 5%.

Keep in mind that the random chance that a CDir is ephemerally pinned is determined when the CDir is created or loaded into memory. Once it has an ephemeral pin, it remains that way for the duration it is in an MDS cache.

Sounds good! I'll get on it.

#11 Updated by Patrick Donnelly 22 days ago

  • Blocks Bug #41541: mgr/volumes: ephemerally pin volumes added

Also available in: Atom PDF