Osdmap - primary role affinity


Allow a tunable "primary affinity" in the OSDMap to shift "primaryness" away from overloaded or struggling OSDs


  • Sage Weil (Inktank)

Interested Parties

Current Status

Please describe the current status of Ceph as it relates to this blueprint. Is there something that this replaces? Are there current features that are related?abi
Currently we have two levels of weighting in teh OSDMap:
  • crush weight: controls the proportional amount of data (pgs) an osd gets. normally measured in TB of capacity.
  • osd weight: a value from 0 to 1 that shifts data away from a node (0 out, 1 in, .5 == remap 50% of the pgs away from this osd)

The problem is that the resulting mapping is where both read and write traffic goes. It is generally very expensive to adjust this mapping because actual data has to be moved between devices.
However, although write traffic always goes to all OSDs that a PG maps to, read traffic normally only goes to the primary.

Detailed Description

By choosing a different primary (simply reordering the pg mapping), we can move the read workload around with minimal cost. The idea here is to add a new OSD property to the OSDMap:
  • primary_affinity -- value between 0 and 1, defined for each osd in the map
Normally this value is 1. If it is less than one, we prefer a different OSD in the crush result set with appropriate probability. For example:
  1. for PG x, CRUSH returns [a, b, c]
  2. a has primary_affinity of .5, b and c have 1
  3. with 50% probability, we will choose b or c instead of a. that is,
    1. 50%: [a, b,c ]
    2. 25%: [b, a, c]
    3. 25%: [c, a b]
    4. (this is of course deterministic, based on hash(x); it will always be one of the above)

There will be cli commands to adjust this value:

ceph osd primary-affinity osd.23 .5

Work items

Coding tasks

  1. create a feature bit for this feature
  2. osdmap: add the field, add it to the encoding
    1. this will be somewhat tricky: we need to encode the old format if the target encoding does not include the feature
    2. note that if the feature is present, we may consider doing a completely fresh encoding strategy that is more easily maintainable (the current approach kind of sucks!)
    3. make sure the reequired_features() method/helper (whatever it is called) indicates when there exists a non-1 primary_affinity
  3. osdmap: adjust the mapping function to reorder the output of the crush mapping based on the affinity
    1. note that this should happen close to the crush output, before the pg_temp potentially overrides this value. if there is a pg_temp entry, it should be used as-is irregardless of what the primary-affinity is.
  4. osdmap: add a few simple unit tests that verify that a primary-affinity of 0 means that (in the absense of down/out nodes) an osd is never chosen as the primary
  5. mon: add cli methos to adjust the primary-affinity
  6. linux kernel: add support for the new osdmap encoding
  7. linux kenrel: add support for the mapping primary-affinity logic

Documentation tasks

  1. document the feature!