Project

General

Profile

Feature #18943

crush: add devices class that rules can use as a filter

Added by Loïc Dachary about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Problem

1. We want to have different types of devices (SSD, HDD, NVMe) backing different OSDs within the same host. For example,

-1 24.78857 root default                                              
-4 24.78857     host cpach                                            
# these are 2 TB HDDs
 2  1.81898             osd.2            up  1.00000          1.00000 
 3  1.81898             osd.3            up  0.98000          1.00000 
 4  1.81898             osd.4            up  0.98000          1.00000 
 5  1.81898             osd.5            up  0.92999          1.00000 
 6  1.81898             osd.6            up  1.00000          1.00000 
 7  1.81898             osd.7            up  1.00000          1.00000 
 8  1.81898             osd.8            up  1.00000          1.00000 
15  1.81929             osd.15           up  1.00000          1.00000 
# these are 1 TB SSDs
 0  0.93100             osd.0            up  1.00000          1.00000 
 1  0.92599             osd.1            up  1.00000          1.00000 
 9  0.93100             osd.9            up  0.89999          1.00000 
10  0.93100             osd.10           up  1.00000          1.00000 
11  0.93100             osd.11           up  1.00000          1.00000 
12  0.93100             osd.12           up  0.89999          1.00000 
13  0.93100             osd.13           up  0.79999          1.00000 
14  0.93100             osd.14           up  0.70000          1.00000 
16  0.93100             osd.16           up  1.00000          1.00000 
17  0.93100             osd.17           up  0.79999          1.00000 
18  0.93140             osd.18           up  0.89999          1.00000 

2. We want to preserve a hierarchical description of the cluster that is easy to view and understand by the user. For example, all OSDs on host cpach are under that node in the tree.

3. We want to make rules that can apply to a specific type of device. For example, a cephfs metadata pool that uses SSDs only.

CRUSH has a "type" that is used to describe non-device nodes (host, rack, row,
root), but all devices are the same (type==0). And the hierarchy is summed
and weighted based on all devices. In order for an ssd-only rule to do
a placement and traverse the tree, it would need a set of weights that only
include ssd devices.

Proposal

Extend the device command like so:

    # devices
    device 0 osd.0 class ssd
    device 1 osd.1 class ssd
    device 2 osd.2 class hdd
    device 3 osd.3 class hdd

    [...]

    device 18 osd.18 class ssd

That means we have the information, at least. We store this in a new map at the end
of the CRUSH map. It will be used for management but not directly used for mapping:

    map<int32_t,string> class_names;       // class id -> friendly name
    map<int32_t,int32_t> device_class;     // device id -> class id

Then, we have a build_class_trees() method that runs after compilation. It will produce
a secondary set of hierarchies, like so:

    # this is the primary, user-managed hierarchy:
    root default
    host cpach
    osd.2
    osd.3
    osd.4
    ...
    # these are generated automatically
    root default~hdd
    host cpach~hdd
    osd.2
    osd.3
    ...

    root default~ssd
    host cpach~ssd
    osd.0
    osd.1
    osd.18
    ...

Note that the ~ separator (or whatever character we choose) should be an otherwise illegal
character so that these names cannot collide with buckets the user manually defines
themselves.

Any rule can then use one of the secondary trees, like so:

    rule ssd {
            ruleset 1
            type replicated
            min_size 1
            max_size 10
            step take default class ssd 
            step chooseleaf firstn 0 type host
            step emit
    }

Include derivative bucket ids for each class in the decompiled bucket. These, like ID, are
selected automatically if not specified during compile, but are included on decompile so that a decompile -> compile cycle generates the same tree with the same bucket ids.

    host cpach {
            id -3           # do not change unnecessarily
            class-id hdd -5 # do not change unnecessarily
            class-id ssd -6 # do not chagne unnecessarily
            # weight 16.409
            alg straw2
            hash 0  # rjenkins1
            item osd.2 weight 1.819
    }

Things to update

  • CrushCompiler (as shown above)
    • step take syntax has a new argument for the device class, forcing us to use a derivative of the named bucket
    • the compiler will simply map this to the derivative bucket id; the geenrated TAKE op in the crush map is the same--just a different id.
    • use the class-id specified id for derivative buckets (if present)
  • CrushWrapper gets a generate_device_type_trees() post-processing step
    • generate derivative trees for all devices classes in use
    • use an otherwise illegal name (i.e. $bucketname~$class where ~ is a character that is not a legal for bucket names).
  • CrushCompiler decompile
    • skip derivative trees (as identified by the magic ~ or similar character)
    • include 'class-id $class $id' lines in source bucket
  • ceph osd crush tree
    • show both canonical tree and derivative trees? (always, or via a flag?)
  • ceph osd df tree and ceph osd tree
    • show both trees? (always, or via a flag?)
    • add new column to show device type? or prefix "osd.123" with type, e.g. "ssd osd.123"

History

#1 Updated by Loïc Dachary about 7 years ago

Instead of

host cpach {
            id -3           # do not change unnecessarily
            class-id hdd -5 # do not change unnecessarily
            class-id ssd -6 # do not chagne unnecessarily
            # weight 16.409
            alg straw2
            hash 0  # rjenkins1
            item osd.2 weight 1.819
    }

why not just
host cpach {
            id -3           # do not change unnecessarily
            # weight 16.409
            alg straw2
            hash 0  # rjenkins1
            item osd.2 weight 1.819
    }

Is there a reason to expose the generated bucket id to the user ? Also, the weight associated with the ssd bucket is unlikely to be the same as the weight of the hdd bucket. Either we need to show them separated entirely:
host cpach~ssd {
            id hdd -5 # do not change unnecessarily
            alg straw2
            hash 0  # rjenkins1
            item osd.2 weight 0.5
    }

host cpach~hdd {
            id hdd -5 # do not change unnecessarily
            alg straw2
            hash 0  # rjenkins1
            item osd.4 weight 2.4
    }

or we need to hide them completely.

#2 Updated by Loïc Dachary about 7 years ago

<loicd> sage: I'm confused by how we should handle the weights with the device classes. The weight of the generated buckets will have to be updated based on the weight of the leaf devices they contain. There is no way to guess what the user intented if the weights in the input crushmap have been manually updated. It also means that if we expose the generated buckets and allow the user to modify their weight via ceph osd ..., we cannot decompile the crushmap extracted from the mon without exposing each bucket explicitly.
<sage> i suggest we don't allow the user to decompile the generated buckets
<sage> and for the weights, stick with a strict hierarchical weighting.
<sage> if the user specifies weights in the main tree that don't sum up, issue a warning during compile
<sage> (they really should be doing that anyway)
<sage> *shouldn't
<loicd> ok, that makes sense to me, thanks for the input

#4 Updated by Loïc Dachary about 7 years ago

  • Subject changed from crush device classes to crush: add devices class that rules can use as a filte

#5 Updated by Loïc Dachary about 7 years ago

  • Subject changed from crush: add devices class that rules can use as a filte to crush: add devices class that rules can use as a filter

#6 Updated by Loïc Dachary about 7 years ago

  • Status changed from In Progress to Resolved

#7 Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (10)

Also available in: Atom PDF