Crush extension for more flexible object placement » History » Version 1
Jessica Mack, 06/23/2015 02:03 AM
1 | 1 | Jessica Mack | h1. Crush extension for more flexible object placement |
---|---|---|---|
2 | |||
3 | h3. Summary |
||
4 | |||
5 | Extend crush to allow more flexible object placement |
||
6 | |||
7 | h3. Owners |
||
8 | |||
9 | * Li Wang (liwang@ubuntukylin.com) |
||
10 | * Lianghao Shen (lianghaoshen@ubuntukylin.com) |
||
11 | |||
12 | h3. Interested Parties |
||
13 | |||
14 | * Name (Affiliation) |
||
15 | * Name (Affiliation) |
||
16 | * Name |
||
17 | |||
18 | h3. Current Status |
||
19 | |||
20 | h3. Detailed Description |
||
21 | |||
22 | This blueprint is originally proposed by Sage at https://wiki.ceph.com/Planning/Blueprints/Dumpling/extend_crush_rule_language. |
||
23 | |||
24 | CRUSH is a deterministic, rule-based consistent hash-like algorithm (with some very nice properties) for determining object placement in distributed storage systems. Its selection rule language, while already very useful, is incapable of expressing some useful rules: |
||
25 | - The current choose iterates over the 'working set' and recursively selects new items for each item. It always applies to all items in the working set. That precludes strategies like "pick 2 racks, choose N from the first, and M from the second". |
||
26 | - It is assumed the hierarchy is a single uniform tree. You cannot have two parallel trees of devices (say, SSDs and HDDs) in the same nodes, and pick 1 ssd and 1 hdd but ensure that they exist in different hosts. |
||
27 | |||
28 | Algorithm: |
||
29 | Some new features should be developed for these above scenarios: |
||
30 | - Unsymmetric choose option, e.g., _assign(N,TYPE1,TYPE2,a1,a2...an)_, which supports choosing _N TYPE1_ buckets, and then choosing _ai(i=1,2..N) TYPE2_ buckets from the _N TYPE1_ buckets, respectively. |
||
31 | |||
32 | <pre> |
||
33 | assign(n,type1,type2,a1,a2...an) { |
||
34 | map = [a1,a2, ..., an]; |
||
35 | out1 = choose_firstn (n, type1, ...); // choose n items with a type of type1 |
||
36 | for items m in out1 { |
||
37 | sub_item = choose_firstn(map[m], type2, ...); |
||
38 | out2 = out2+sub_item; |
||
39 | } |
||
40 | } |
||
41 | </pre> |
||
42 | |||
43 | Osds grouping, _group {devicetype|network}_, which supports identifying the osds with groupids according to the given strategy, namely, by devicetype or network. |
||
44 | - Modified chooseleaf option, _chooseleaf firstn {num} type {bucket-type} [group]_, which can support placing the replicas in different groups, relative to the original one. |
||
45 | |||
46 | <pre> |
||
47 | group {devicetype|network...} |
||
48 | for osds in pool: |
||
49 | if osd.device == ssd |
||
50 | osd.gid = 1; |
||
51 | else |
||
52 | osd.gid = 2; |
||
53 | |||
54 | choose firstn n type osd [group] |
||
55 | item = crush_bucket_choose(in, x, r); |
||
56 | if(group){ |
||
57 | if (is_gid_selected(out,item)){ |
||
58 | fgroup++; |
||
59 | retry_group = 1; |
||
60 | } |
||
61 | } |
||
62 | </pre> |
||
63 | |||
64 | h3. Work items |
||
65 | |||
66 | h4. Coding tasks |
||
67 | |||
68 | # Task 1 |
||
69 | # Task 2 |
||
70 | # Task 3 |
||
71 | |||
72 | h4. Build / release tasks |
||
73 | |||
74 | # Task 1 |
||
75 | # Task 2 |
||
76 | # Task 3 |
||
77 | |||
78 | h4. Documentation tasks |
||
79 | |||
80 | # Task 1 |
||
81 | # Task 2 |
||
82 | # Task 3 |
||
83 | |||
84 | h4. Deprecation tasks |
||
85 | |||
86 | # Task 1 |
||
87 | # Task 2 |
||
88 | # Task 3 |