Ceph Tiering provides fine grained administrative control of the placement of data within Ceph cached pools.
Currently, when Ceph joins pools (typically an upper and lower) it forms a cache, i.e., data moves between the pools "on demand"; data is demoted from an upper pool into a lower pool based solely upon the need to create unused space in the upper pool (e.g., when it's "full"). Conversely, data is promoted from the lower into the upper pool when required to service a particular RADOS operation (i.e., read, write, append, etc.).
This tiering proposal permits the identification and tagging of data so as to optimize the operation of the cache. In this context, optimization means to promote or demote data based on criteria other than need by the system (i.e., the caching behavior described above). More specifically, tiering allows optimizations such as creating objects directly in the lower tier (rather than creating them in the upper tier and having the system eventually demoting them when the upper tier fills up), or demoting data at a specific time.
At the implementation level, each RADOS object has associated with it a policy that provides specific action hints for that object. RADOS uses that policy to assist it in optimizing the movement of objects between the pools. However, the policy is only a hint and does not in any way suppress or modify the on-demand caching hehavior described above. This means that if the upper pool is full, RADOS will demote objects to the lower pool as it sees fit, i.e., without regard to any tiering policy. Naturally, a more sophisticated implementation of the caching algorithms could very well look at the tiering policy to make better decisions.
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
Please describe the current status of Ceph as it relates to this blueprint. Is there something that this replaces? Are there current features that are related?
For each object, RADOS has three places where it searches for a policy for that object. Once a policy is found the search stops and other "higher up" policies in the search order have no effect. The three places are: global, per-pool and per-object. First RADOS looks for a per-object policy (stored as a well-known xattr). If no per-object policy is found then RADOS looks at the pool which contains the object. If no per-pool policy is found then RADOS uses the global policy which is always present, because there is a default policy compiled into RADOS which can't be erased.
Modification of the per-pool and global policy is not synchronized and need not trigger a policy event, though a sweep could optionally be initiated.
A Policy is a set of name value pairs. Internally, it's a map of strings to strings. Herein after, we will refer to this map as the policy object.
A policy function is a function which takes a set of inputs and returns an action. Policy functions are "pure" in the computer science sense in that they don't rely on any external state beyond what is passed into them and thus will always return the same result for the same inputs.
PolicyObject["Name"] specifies the policy function. Some values of the policy name are compiled into the OSD code. If the policy name is not one of the pre-compiled set of functions then it is treated as a RADOS class name.
Each policy function takes three sets of inputs. The first set is the policy object itself. The second set of inputs is metadata about the object itself (TBD). The third set of inputs is metadata about the Pool in which the Object is in currently. [??? How hard is it to have a fourth set of inputs which is metadata about the OTHER pool(s)].
The result of the selected policy function is either the empty string, meaning that no movement is requested, OR the name of the pool into which this object should be moved. If the name doesn't match any reachable pool, then it's an error which can be ignored.
Pool Relative Names¶
Pool names can be absolute, e.g., fred, foo, barney, etc. Absolute pool names correspond directly to RADOS pool names. Pool names can relative, e.g., $cache, $base which refer to their location in the cache hierarchy. Generally, policies are specified using relative names so as to maximize configuration independence.
If pool name look up resolution fails, then currently it would be handled as an error. Similarly if the pool specified does not belong to the cache pool, then it would result in triggering an error. We could later use stub redirection implementation to support such configurations.
The second set of per-object information should include the following:
pool -- name of pool object is currently in.
mtime -- More recent time the contents of this object were modified from a RADOS client. Only data and omap are considered. Motion between tiers explicitly doesn't affect this value.
btime -- birth time of this object
ttime -- time at which this object was moved into this pool (~= ctime for objects that have never been moved)
size -- Size of this object (in bytes)
reason -- Reason that this object is in this tier. Reasons include: REASON_READ, REASON_WRITE, REASON_CREATE and REASON_POLICY. REASON_READ and REASON_WRITE indicate movement into this tier on-demand by RADOS itself. REASON_POLICY indicates that movement was initiated by a policy. REASON_CREATE is the special case of creating an object.
The third set of information about the pool in which this object is residing.
name -- name of the pool
full_ratio -- How "full" is this pool.
Pre-compiled Policy Functions¶
PolicyObject["Name"] == "Standard"
This is the main policy that ought to cover 99% of the use-cases.
This policy operates applying one or more timeout values based on the current pool and state of the object. If the timeout values are satisfied, then the policy directs that the object be moved to another tier.
Timeout values are specified as one of the name value pairs in the policy object. The name of the value indicates the state for which this timeout applies and the value of the pair indicates the actual duration of the timeout. If the timeout is satisfied, then a similarly value of a similarly named key is returned as the result of the policy function (i.e., indicates which Tier to move the object to).
Generally, the keys are named by concatenating a few fields, e.g. the current pool, the "reason" field, etc.
Here is a listing:
<pool>.Read.Duration Maximum time that an object can live in the named pool after it was placed there due to an on-demand read operation.
<pool>.Read.EvictPool Pool to move object to if <pool>.Read.Duration timeout is triggered
<pool>.Write.Duration Maximum time that an object can live in the named pool after it was placed there due to an on-demand write operation (mtime)
<pool>.Write.EvictPool Pool to move object to if <pool>.Write.Duration timeout is triggered
<pool>.Create.Duration Maximum time that an object can live in the named pool after it was placed there due to creation.
<pool>.Create.EvictPool Pool to move object to if <pool>.Create.Duration timeout is triggered
<pool>.Policy.Duration Maximum time that an object can live in the named pool after it was placed there due to a policy driven operation
<pool>.Policy.EvictPool Pool to move object to if <pool>.Policy.Duration timeout is triggered
// Here is an example of a policy that directs objects to be created in the base tier
$cache.Policy.Duration 0 // Don't allow creation in this tier
// Here is an example of a policy that directs objects to be created in the base tier and evicts them after 10 units if promoted due to a read operation
$cache.Read.Duration 10 // If promoted due to a read, evict after 10 units.
$cache.Read.EvictPool $base // base is the destination pool
$cache.Policy.Duration 0 // Don't allow creation in this tier
OSD Implementation of Policies¶
The global policy should be configurable through the normal parameter/config file mechanism.
The per-pool policy should be configurable ??? I assume we can stick this on some object somewhere inside of the pool itself. (pg_pool_t ??)
Read, write a policy for a pool
Read, write read a policy for an object.
There should be special RADOS verbiage for attaching a policy to an object that's being created.
In the RGW world, the RADOS policy is expressed as a well-known HTTP header. The policy for an object may be read, written and/or modified using any of the standard metadata access mechanisms for an RGW object. Note that the policy is conceptually at the RGW level, meaning that when an RGW object is deconstructed into multiple RADOS objects that all of those RADOS objects will have the same policy. [Do we want directory listings to have policies in them??]
If the creation of an object doesn't have a specific per-policy header, then RGW will use a set of per-bucket policies as described below. If none of the per-bucket policies can be applied to the object then RGW will leave the RADOS policy attribute unset, meaning that RADOS will apply the default per-pool and/or global policy to that object.
BUCKET Level Policies¶
Associated with each bucket is a vector of possible policies. Each element of the vector has a matching section and a policy section. When an object is created, RGW searches the vector (in priority order from 0 to n-1) looking for an element that matches the object which is being created. When a matching element is found, RGW places the corresponding policy section on that RGW object (just as if the policy had been provided in the metadata headers which created the object) and terminates the policy search.
A matching section consists of a regex (details TBD) and a range of sizes. If the URL for the object (without the bucket ??) matches AND the size of the object is within the specified range, then this matching section is considered as matched. The policy section consists of the RADOS policy for this RGW object.
Bucket level policies are directly modifiable by user programs just like any other bucket-level metadata. Changes to bucket-level matching rules only affect the future creation of objects. Objects which have previously been created are unaffected by the change in bucket-level matching rules. Perhaps in the future we can create some kind of re-scan process that will update all of policies for some bucket of objects more efficiently.
RGW will create specially named objects to hold per bucket policies. These will be XML style blob objects which can either be accessed through standard PUT/GET interface or bucket ops similar to GetBucketAccessControlPolicy.
To be investigated¶
1. Investigate race effects of promotion of objects due to Policy, rather than from cache tier
2. How do we handle updates of objects written directly to lower/base tier
Build / release tasks