Project

General

Profile

Tiering-enhacement » History » Version 2

shishir gowda, 06/10/2015 02:26 AM

1 1 shishir gowda
h1. Tiering-enhacement
2
3
*Summary*
4
5
Ceph Tiering provides fine grained administrative control of the placement of data within Ceph cached pools.
6
7
Currently, when Ceph joins pools (typically an upper and lower) it forms a cache, i.e., data moves between the pools "on demand"; data is demoted from an upper pool into a lower pool based solely upon the need to create unused space in the upper pool (e.g., when it's "full"). Conversely, data is promoted from the lower into the upper pool when required to service a particular RADOS operation (i.e., read, write, append, etc.). 
8
9
This tiering proposal permits the identification and tagging of data so as to optimize the operation of the cache. In this context, optimization means to promote or demote data based on criteria other than need by the system (i.e., the caching behavior described above). More specifically, tiering allows optimizations such as creating objects directly in the lower tier (rather than creating them in the upper tier and having the system eventually demoting them when the upper tier fills up), or demoting data at a specific time.
10
11
At the implementation level, each RADOS object has associated with it a policy that provides specific action hints for that object. RADOS uses that policy to assist it in optimizing the movement of objects between the pools. However, the policy is only a hint and does not in any way suppress or modify the on-demand caching hehavior described above. This means that if the upper pool is full, RADOS will demote objects to the lower pool as it sees fit, i.e., without regard to any tiering policy. Naturally, a more sophisticated implementation of the caching algorithms could very well look at the tiering policy to make better decisions.
12
13
14
*Owners*
15
16
Allen Samuels <Allen.Samuels@sandisk.com>
17
Chaitanya Huilgol <Chaitanya.Huilgol@sandisk.com>
18
Shishir Gowda <Shishir.Gowda@sandisk.com>
19
20
*Interested Parties*
21
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
22
Name (Affiliation)
23
Name (Affiliation)
24
Name
25
26
*Current Status*
27
Please describe the current status of Ceph as it relates to this blueprint.  Is there something that this replaces?  Are there current features that are related?
28
29
*Detailed Description*
30
31 2 shishir gowda
h4. Policy Specification
32 1 shishir gowda
33
For each object, RADOS has three places where it searches for a policy for that object. Once a policy is found the search stops and other "higher up" policies in the search order have no effect. The three places are: global, per-pool and per-object. First RADOS looks for a per-object policy (stored as a well-known xattr). If no per-object policy is found then RADOS looks at the pool which contains the object. If no per-pool policy is found then RADOS uses the global policy which is always present, because there is a default policy compiled into RADOS which can't be erased.
34
35
Modification of the per-pool and global policy is not synchronized and need not trigger a policy event, though a sweep could optionally be initiated.
36
37
Policy Specification
38
A Policy is a set of name value pairs. Internally, it's a map of strings to strings. Herein after, we will refer to this map as the policy object.
39
40 2 shishir gowda
h4. Policy Function
41 1 shishir gowda
42
A policy function is a function which takes a set of inputs and returns an action. Policy functions are "pure" in the computer science sense in that they don't rely on any external state beyond what is passed into them and thus will always return the same result for the same inputs.
43
44
PolicyObject["Name"] specifies the policy function. Some values of the policy name are compiled into the OSD code. If the policy name is not one of the pre-compiled set of functions then it is treated as a RADOS class name.
45
46
Each policy function takes three sets of inputs. The first set is the policy object itself. The second set of inputs is metadata about the object itself (TBD). The third set of inputs is metadata about the Pool in which the Object is in currently. [??? How hard is it to have a fourth set of inputs which is metadata about the OTHER pool(s)].
47
48
The result of the selected policy function is either the empty string, meaning that no movement is requested, OR the name of the pool into which this object should be moved. If the name doesn't match any reachable pool, then it's an error which can be ignored. 
49
50 2 shishir gowda
h4. Pool Relative Names
51 1 shishir gowda
52
Pool names can be absolute, e.g., fred, foo, barney, etc. Absolute pool names correspond directly to RADOS pool names. Pool names can relative, e.g., $cache, $base which refer to their location in the cache hierarchy. Generally, policies are specified using relative names so as to maximize configuration independence.
53
54
If pool name look up resolution fails, then currently it would be handled as an error. Similarly if the pool specified does not belong to the cache pool, then it would result in triggering an error. We could later use stub redirection implementation to support such configurations.
55
56
The second set of per-object information should include the following:
57
58
pool  -- name of pool object is currently in.
59
mtime -- More recent time the contents of this object were modified from a RADOS client. Only data and omap are considered. Motion between tiers explicitly doesn't affect this value.
60
btime -- birth time of this object
61
ttime -- time at which this object was moved into this pool (~= ctime for objects that have never been moved)
62
size  -- Size of this object (in bytes)
63
reason -- Reason that this object is in this tier. Reasons include: REASON_READ, REASON_WRITE, REASON_CREATE and REASON_POLICY. REASON_READ and REASON_WRITE indicate movement into this tier on-demand by RADOS itself. REASON_POLICY indicates that movement was initiated by a policy. REASON_CREATE is the special case of creating an object.
64
65
The third set of information about the pool in which this object is residing.
66
67
name -- name of the pool
68
full_ratio -- How "full" is this pool.
69
70 2 shishir gowda
h4. Pre-compiled Policy Functions
71 1 shishir gowda
72
PolicyObject["Name"] == "Standard"
73
74
This is the main policy that ought to cover 99% of the use-cases.
75
76
This policy operates applying one or more timeout values based on the current pool and state of the object. If the timeout values are satisfied, then the policy directs that the object be moved to another tier.
77
78
Timeout values are specified as one of the name value pairs in the policy object. The name of the value indicates the state for which this timeout applies and the value of the pair indicates the actual duration of the timeout. If the timeout is satisfied, then a similarly value of a similarly named key is returned as the result of the policy function (i.e., indicates which Tier to move the object to).
79
80
Generally, the keys are named by concatenating a few fields, e.g. the current pool, the "reason" field, etc.
81
82
Here is a listing:
83
84
<pool>.Read.Duration		Maximum time that an object can live in the named pool after it was placed there due to an on-demand read operation.
85
<pool>.Read.EvictPool		Pool to move object to if <pool>.Read.Duration timeout is triggered
86
<pool>.Write.Duration		Maximum time that an object can live in the named pool after it was placed there due to an on-demand write operation (mtime)
87
<pool>.Write.EvictPool		Pool to move object to if <pool>.Write.Duration timeout is triggered
88
<pool>.Create.Duration		Maximum time that an object can live in the named pool after it was placed there due to creation.
89
<pool>.Create.EvictPool		Pool to move object to if <pool>.Create.Duration timeout is triggered
90
<pool>.Policy.Duration		Maximum time that an object can live in the named pool after it was placed there due to a policy driven operation
91
<pool>.Policy.EvictPool		Pool to move object to if <pool>.Policy.Duration timeout is triggered
92
93
94
//
95
// Here is an example of a policy that directs objects to be created in the base tier
96
//
97
$cache.Policy.Duration 0		// Don't allow creation in this tier
98
$cache.Policy.EvictPool $base
99
100
//
101
// Here is an example of a policy that directs objects to be created in the base tier and evicts them after 10 units if promoted due to a read operation
102
//
103
$cache.Read.Duration 10			// If promoted due to a read, evict after 10 units.
104
$cache.Read.EvictPool  $base		// base is the destination pool
105
$cache.Policy.Duration 0		// Don't allow creation in this tier
106
$cache.Policy.EvictPool $base
107
108
109 2 shishir gowda
h4. OSD Implementation of Policies
110 1 shishir gowda
111
The global policy should be configurable through the normal parameter/config file mechanism.
112
113
The per-pool policy should be configurable ??? I assume we can stick this on some object somewhere inside of the pool itself.  (pg_pool_t ??)
114
115 2 shishir gowda
h4. RADOS modification
116 1 shishir gowda
117
Rados operations:
118
119
Read, write a policy for a pool
120
121
Read, write read a policy for an object. 
122
123
There should be special RADOS verbiage for attaching a policy to an object that's being created.
124
125 2 shishir gowda
h4. RGW IMPLEMENTATION
126 1 shishir gowda
127
In the RGW world, the RADOS policy is expressed as a well-known HTTP header. The policy for an object may be read, written and/or modified using any of the standard metadata access mechanisms for an RGW object. Note that the policy is conceptually at the RGW level, meaning that when an RGW object is deconstructed into multiple RADOS objects that all of those RADOS objects will have the same policy. [Do we want directory listings to have policies in them??]
128
129
If the creation of an object doesn't have a specific per-policy header, then RGW will use a set of per-bucket policies as described below. If none of the per-bucket policies can be applied to the object then RGW will leave the RADOS policy attribute unset, meaning that RADOS will apply the default per-pool and/or global policy to that object.
130
131 2 shishir gowda
h4. BUCKET Level Policies
132 1 shishir gowda
133
Associated with each bucket is a vector of possible policies. Each element of the vector has a matching section and a policy section. When an object is created, RGW searches the vector (in priority order from 0 to n-1) looking for an element that matches the object which is being created. When a matching element is found, RGW places the corresponding policy section on that RGW object (just as if the policy had been provided in the metadata headers which created the object) and terminates the policy search.
134
135
A matching section consists of a regex (details TBD) and a range of sizes. If the URL for the object (without the bucket ??) matches AND the size of the object is within the specified range, then this matching section is considered as matched. The policy section consists of the RADOS policy for this RGW object.
136
137
Bucket level policies are directly modifiable by user programs just like any other bucket-level metadata. Changes to bucket-level matching rules only affect the future creation of objects. Objects which have previously been created are unaffected by the change in bucket-level matching rules. Perhaps in the future we can create some kind of re-scan process that will update all of policies for some bucket of objects more efficiently.
138
139
RGW will create specially named objects to hold per bucket policies. These will be XML style blob objects which can either be accessed through standard PUT/GET interface or bucket ops similar to GetBucketAccessControlPolicy. 
140
141 2 shishir gowda
h4. To be investigated
142 1 shishir gowda
143
1. Investigate race effects of promotion of objects due to Policy, rather than  from cache tier
144
2. How do we handle updates of objects written directly to lower/base tier
145
146
*Work items*
147
148
*Coding tasks*
149
Task 1
150
Task 2
151
Task 3
152
153
*Build / release tasks*
154
Task 1
155
Task 2
156
Task 3
157
158
*Documentation tasks*
159
Task 1
160
Task 2
161
Task 3
162
163
*Deprecation tasks*
164
Task 1
165
Task 2
166
Task 3