Feature #41302

Updated by Patrick Donnelly over 1 year ago

Background: export pins [1] are an effective way to distribute metadata load for large workloads without the metadata balancer interfering. We have found that you can achieve expected linear metadata throughput scaling (with MDS count). However, teaching users or applications how to effectively apply export pins is not always possible.

So, let's make it possible for the MDS to apply a pinning strategy to subtrees. The pinning does not need to be perfect; even a poor distribution is still generally better than single MDS performance. Caveat: some care must be taken to not create too many subtrees as this could degrade performance.

There are two parts to this ticket:

1. Create a new persistent (directory) inode field "export_ephemeral_distributed". This applies only to the directory and is not hierarchical. Any _direct descendant directory_ (i.e. a child directory) has an ephemeral export pin applied to it according to a consistent hash [2] of the child directory inode number. This involves each directory knowing its immediate parent's "ephemeral_export_distributed" value. Any MDS rank can figure out where such a directory should be pinned by knowing the hash (module 360) and the number of ranks (which can be linearly distributed points on the circle).

2. Create a new persistent (directory) inode field "export_ephemeral_random". This is hierarchical like "export_pin". Any CDir (fragment!) loaded into the cache may be ephemerally pinned to a random rank. Like "export_ephemeral_distributed", the random rank is determined by a consistent hash. Notably, if another rank is added or removed then the ephemerally pinned subtrees should be uniformly distributed across the ranks. A directory fragment (CDir) that is pinned in this way will remain pinned for as long as it is in the distributed MDS cache (i.e. some MDS has it in memory).

Tests should verify

* that "export_ephemeral_distributed" is approximately uniform and only applies to the direct descendants
* that "export_ephemeral_random" is heirarchical
* that changing max_mds redistributes approximately 1/N or less of the ephemerally pinned subtrees
* that changing max_mds does not create or remove ephemerally pinned subtrees
* that export_pin overrides all ephemeral pins\

Performance testing tests should

also validate that large ephemeral pin changes (due to max_mds changes) do not destabilize the MDS cluster
* identify any performance degradation caused by too many pinned subtrees