Project

General

Profile

Osdmap - primary role affinity » History » Version 1

Jessica Mack, 06/22/2015 03:17 AM

1 1 Jessica Mack
h1. Osdmap - primary role affinity
2
3
h3. Summary
4
5
Allow a tunable "primary affinity" in the OSDMap to shift "primaryness" away from overloaded or struggling OSDs
6
7
h3. Owners
8
9
* Sage Weil (Inktank)
10
11
h3. Interested Parties
12
13
* Danny Al-Gaaf <danny.al-gaaf@bisect.de>
14
* Anip Patel (Arizona State University(student))
15
16
h3. Current Status
17
18
Please describe the current status of Ceph as it relates to this blueprint.  Is there something that this replaces?  Are there current features that are related?abi
19
Currently we have two levels of weighting in teh OSDMap:
20
* crush weight: controls the proportional amount of data (pgs) an osd gets.  normally measured in TB of capacity.
21
* osd weight: a value from 0 to 1 that shifts data away from a node (0 == out, 1 == in, .5 == remap 50% of the pgs away from this osd)
22
23
The problem is that the resulting mapping is where both read and write traffic goes.  It is generally very expensive to adjust this mapping because actual data has to be moved between devices.
24
However, although write traffic always goes to all OSDs that a PG maps to, read traffic normally only goes to the primary.  
25
26
h3. Detailed Description
27
28
By choosing a different primary (simply reordering the pg mapping), we can move the read workload around with minimal cost.  The idea here is to add a new OSD property to the OSDMap:
29
* primary_affinity -- value between 0 and 1, defined for each osd in the map
30
31
Normally this value is 1.  If it is less than one, we prefer a different OSD in the crush result set with appropriate probability.  For example:
32
# for PG x, CRUSH returns [a, b, c]
33
# a has primary_affinity of .5, b and c have 1
34
# with 50% probability, we will choose b or c instead of a.  that is,
35
## 50%: [a, b,c ]
36
## 25%: [b, a, c]
37
## 25%: [c, a b]
38
## (this is of course deterministic, based on hash(x); it will always be one of the above)
39
40
There will be cli commands to adjust this value:
41
42
p((. ceph osd primary-affinity osd.23 .5
43
44
h3. Work items
45
46
h4. Coding tasks
47
48
# create a feature bit for this feature
49
# osdmap: add the field, add it to the encoding
50
## this will be somewhat tricky: we need to encode the old format if the target encoding does not include the feature
51
## note that if the feature *is* present, we may consider doing a completely fresh encoding strategy that is more easily maintainable (the current approach kind of sucks!)
52
## make sure the reequired_features() method/helper (whatever it is called) indicates when there exists a non-1 primary_affinity
53
# osdmap: adjust the mapping function to reorder the output of the crush mapping based on the affinity
54
## note that this should happen close to the crush output, *before* the pg_temp potentially overrides this value.  if there is a pg_temp entry, it should be used as-is irregardless of what the primary-affinity is.
55
# osdmap: add a few simple unit tests that verify that a primary-affinity of 0 means that (in the absense of down/out nodes) an osd is never chosen as the primary
56
# mon: add cli methos to adjust the primary-affinity 
57
# linux kernel: add support for the new osdmap encoding
58
# linux kenrel: add support for the mapping primary-affinity logic
59
60
h4. Documentation tasks
61
62
# document the feature!