Project

General

Profile

Feature #9410

Crush optimization for unbalanced data/pg distribution

Added by jian zhang over 9 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Hi all,
? Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.

Key Message:
? As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called “linear”, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.

Design and Implementation:
1. Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2. Design
2.1 New pps hash algorithm
We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
1) Try different balance_param when preparing for a new pool
??- Iteratively call CRUSH to get corresponding PG distribution with different balance_params
??- Calculate stdev of PG# among all osds
??- Choose the balance_param with the minimal stdev
2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
??The adaptive procedure can be described as following:
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
for pgid from 0 to m {
calculate pps using the new generator in 2.1;
for bucket b in cluster map // apply CRUSH algorithm
apply corresponding bucket hashing algorithm and get a osd list for pgid
}
calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
if pg_stdev_a < min_pg_stdev {
min_pg_stdev = pg_stdev_a;
balance_param = a;
}
adjust a to a new value;
}

Evaluation:
We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
1. PG and data distribution is more balanced using optimized CRUSH algorithm
??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2. Large scaled performance is improved since data distribution is more balanced
??a) More than 10% performance improvement for 128K and 10M read
??b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).

We also created a pull request: https://github.com/ceph/ceph/pull/2402

adaptive-crush-modify.patch View - Patch for CRUSH optimization (33.7 KB) jian zhang, 09/09/2014 06:25 PM

crush_proposals.pdf - Previous Performance Data and Optimization Proposals (926 KB) jian zhang, 09/09/2014 06:25 PM

crush_optimization.pdf - CRUSH optimization design and implementation (594 KB) jian zhang, 09/09/2014 06:25 PM

Also available in: Atom PDF