Feature #18943
crush: add devices class that rules can use as a filter
0%
Description
Problem¶
1. We want to have different types of devices (SSD, HDD, NVMe) backing different OSDs within the same host. For example,
-1 24.78857 root default -4 24.78857 host cpach # these are 2 TB HDDs 2 1.81898 osd.2 up 1.00000 1.00000 3 1.81898 osd.3 up 0.98000 1.00000 4 1.81898 osd.4 up 0.98000 1.00000 5 1.81898 osd.5 up 0.92999 1.00000 6 1.81898 osd.6 up 1.00000 1.00000 7 1.81898 osd.7 up 1.00000 1.00000 8 1.81898 osd.8 up 1.00000 1.00000 15 1.81929 osd.15 up 1.00000 1.00000 # these are 1 TB SSDs 0 0.93100 osd.0 up 1.00000 1.00000 1 0.92599 osd.1 up 1.00000 1.00000 9 0.93100 osd.9 up 0.89999 1.00000 10 0.93100 osd.10 up 1.00000 1.00000 11 0.93100 osd.11 up 1.00000 1.00000 12 0.93100 osd.12 up 0.89999 1.00000 13 0.93100 osd.13 up 0.79999 1.00000 14 0.93100 osd.14 up 0.70000 1.00000 16 0.93100 osd.16 up 1.00000 1.00000 17 0.93100 osd.17 up 0.79999 1.00000 18 0.93140 osd.18 up 0.89999 1.00000
2. We want to preserve a hierarchical description of the cluster that is easy to view and understand by the user. For example, all OSDs on host cpach are under that node in the tree.
3. We want to make rules that can apply to a specific type of device. For example, a cephfs metadata pool that uses SSDs only.
CRUSH has a "type" that is used to describe non-device nodes (host, rack, row,
root), but all devices are the same (type==0). And the hierarchy is summed
and weighted based on all devices. In order for an ssd-only rule to do
a placement and traverse the tree, it would need a set of weights that only
include ssd devices.
Proposal¶
Extend the device command like so:
# devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class hdd device 3 osd.3 class hdd [...] device 18 osd.18 class ssd
That means we have the information, at least. We store this in a new map at the end
of the CRUSH map. It will be used for management but not directly used for mapping:
map<int32_t,string> class_names; // class id -> friendly name map<int32_t,int32_t> device_class; // device id -> class id
Then, we have a build_class_trees() method that runs after compilation. It will produce
a secondary set of hierarchies, like so:
# this is the primary, user-managed hierarchy: root default host cpach osd.2 osd.3 osd.4 ... # these are generated automatically root default~hdd host cpach~hdd osd.2 osd.3 ... root default~ssd host cpach~ssd osd.0 osd.1 osd.18 ...
Note that the ~ separator (or whatever character we choose) should be an otherwise illegal
character so that these names cannot collide with buckets the user manually defines
themselves.
Any rule can then use one of the secondary trees, like so:
rule ssd { ruleset 1 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 0 type host step emit }
Include derivative bucket ids for each class in the decompiled bucket. These, like ID, are
selected automatically if not specified during compile, but are included on decompile so that a decompile -> compile cycle generates the same tree with the same bucket ids.
host cpach { id -3 # do not change unnecessarily class-id hdd -5 # do not change unnecessarily class-id ssd -6 # do not chagne unnecessarily # weight 16.409 alg straw2 hash 0 # rjenkins1 item osd.2 weight 1.819 }
Things to update¶
- CrushCompiler (as shown above)
- step take syntax has a new argument for the device class, forcing us to use a derivative of the named bucket
- the compiler will simply map this to the derivative bucket id; the geenrated TAKE op in the crush map is the same--just a different id.
- use the class-id specified id for derivative buckets (if present)
- CrushWrapper gets a generate_device_type_trees() post-processing step
- generate derivative trees for all devices classes in use
- use an otherwise illegal name (i.e. $bucketname~$class where ~ is a character that is not a legal for bucket names).
- CrushCompiler decompile
- skip derivative trees (as identified by the magic ~ or similar character)
- include 'class-id $class $id' lines in source bucket
- ceph osd crush tree
- show both canonical tree and derivative trees? (always, or via a flag?)
- ceph osd df tree and ceph osd tree
- show both trees? (always, or via a flag?)
- add new column to show device type? or prefix "osd.123" with type, e.g. "ssd osd.123"
History
#1 Updated by Loïc Dachary about 7 years ago
Instead of
host cpach { id -3 # do not change unnecessarily class-id hdd -5 # do not change unnecessarily class-id ssd -6 # do not chagne unnecessarily # weight 16.409 alg straw2 hash 0 # rjenkins1 item osd.2 weight 1.819 }
why not just
host cpach { id -3 # do not change unnecessarily # weight 16.409 alg straw2 hash 0 # rjenkins1 item osd.2 weight 1.819 }
Is there a reason to expose the generated bucket id to the user ? Also, the weight associated with the ssd bucket is unlikely to be the same as the weight of the hdd bucket. Either we need to show them separated entirely:
host cpach~ssd { id hdd -5 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.5 } host cpach~hdd { id hdd -5 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.4 weight 2.4 }
or we need to hide them completely.
#2 Updated by Loïc Dachary about 7 years ago
<loicd> sage: I'm confused by how we should handle the weights with the device classes. The weight of the generated buckets will have to be updated based on the weight of the leaf devices they contain. There is no way to guess what the user intented if the weights in the input crushmap have been manually updated. It also means that if we expose the generated buckets and allow the user to modify their weight via ceph osd ..., we cannot decompile the crushmap extracted from the mon without exposing each bucket explicitly.
<sage> i suggest we don't allow the user to decompile the generated buckets
<sage> and for the weights, stick with a strict hierarchical weighting.
<sage> if the user specifies weights in the main tree that don't sum up, issue a warning during compile
<sage> (they really should be doing that anyway)
<sage> *shouldn't
<loicd> ok, that makes sense to me, thanks for the input
#3 Updated by Loïc Dachary about 7 years ago
#4 Updated by Loïc Dachary about 7 years ago
- Subject changed from crush device classes to crush: add devices class that rules can use as a filte
#5 Updated by Loïc Dachary about 7 years ago
- Subject changed from crush: add devices class that rules can use as a filte to crush: add devices class that rules can use as a filter
#6 Updated by Loïc Dachary about 7 years ago
- Status changed from In Progress to Resolved
#7 Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to RADOS
- Category deleted (
10)