Crush extension for more flexible object placement


Extend crush to allow more flexible object placement


Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Current Status

Detailed Description

This blueprint is originally proposed by Sage at

CRUSH is a deterministic, rule-based consistent hash-like algorithm (with some very nice properties) for determining object placement in distributed storage systems. Its selection rule language, while already very useful, is incapable of expressing some useful rules:
- The current choose iterates over the 'working set' and recursively selects new items for each item. It always applies to all items in the working set. That precludes strategies like "pick 2 racks, choose N from the first, and M from the second".
- It is assumed the hierarchy is a single uniform tree. You cannot have two parallel trees of devices (say, SSDs and HDDs) in the same nodes, and pick 1 ssd and 1 hdd but ensure that they exist in different hosts.

Some new features should be developed for these above scenarios:
- Unsymmetric choose option, e.g., assign(N,TYPE1,TYPE2,a1,, which supports choosing N TYPE1 buckets, and then choosing ai(i=1,2..N) TYPE2 buckets from the N TYPE1 buckets, respectively.

assign(n,type1,type2,a1, {
    map = [a1,a2, ..., an];
    out1 = choose_firstn (n, type1, ...);  // choose n items with a type of type1
    for items m in out1 {
        sub_item = choose_firstn(map[m], type2, ...);
        out2 = out2+sub_item;

Osds grouping, group {devicetype|network}, which supports identifying the osds with groupids according to the given strategy, namely, by devicetype or network.
- Modified chooseleaf option, chooseleaf firstn {num} type {bucket-type} [group], which can support placing the replicas in different groups, relative to the original one.

group {devicetype|network...}
    for osds in pool?
    if osd.device == ssd
        osd.gid = 1;
        osd.gid = 2;

choose firstn n type osd [group]   
    item = crush_bucket_choose(in, x, r);
        if (is_gid_selected(out,item)){
            retry_group = 1;

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3