Project

General

Profile

Osd - Tiering II (Warm->Cold)

Summary

Cache/Tiering can be useful in some cases where the placing the slower storage on another rados tier in the same cluster is beneficial. However, there are use cases where even that would be too expensive. For those cases, it might be useful for ceph to to be able to demote sufficiently frosty objects to some other service using a plugin abstraction.

Owners

  • Sam Just (RedHat)
  • Name (Affiliation)
  • Name

Interested Parties

  • Loic Dachary (Red Hat)
  • Name (Affiliation)
  • Name

Current Status

Detailed Description

The idea is to extend the current cache/tiering infrastructure in a few ways:
1) Add a new flavor of hot tier which instead of using lack of an object to indicate that it needs to check the next lower tier uses explicit link objects to redirect to the lower tier. Thus, the hotter tier acts as a metadata cache of sorts.
2) Create an abstraction for the lower tier machinery into which can be plugged an implementation.

A first zero thought crack at an interface (any actual version would be in C):


class TierImplementation {
class ObjectStream {
public:
/**
 * get more bytes from the object
 * 
 * @param max [in] max bytes to get
 * @param out [out] next bytes of the object
 * @return 0 on success, -error on error, -ERANGE when complete
 */
int get_bytes(size_t max, bufferlist *out) = 0;

/**
 * get more bytes from the omap
 * 
 * @param max [in] max bytes to get
 * @param out [out] map<string, bufferlist> encoding of next keys
 * @return 0 on success, -error on error, -ERANGE when complete
 */
int get_omap(size_t max, bufferlist *out) = 0;

/**
 * get xattrs
 *
 * @param attrs [out] attrs
 * @return 0 on success, -error on error
 */
int get_xattrs(map<string, bufferlist> *out) = 0;
};
typedef uint64_t write_handle_t;
typedef uint64_t read_handle_t
typedef string backend_object_id_t;
static const char *get_implementation_type() = 0;

/**
 * Begin object creation operation
 *
 * @param [out] string backend id -- json -- globally unique
 * @return demotion operation handle
 */
write_handle_t open_object(backend_object_id_t *out) = 0;

/**
 * Append bytes to open object
 *
 * @param [in] write operation handle
 * @param [in] bufferlist bytes to append
 * @param [in] transfers ownership, callback on completion
 * @return -error, 0 on success
 */
int append_bytes(write_handle_t id, bufferlist bytes, Context *c) = 0;

/**
 * Append omap keys
 *
 * @param [in] write operation handle 
 * @param [in] bufferlist map<string, bufferlist> encoding keys to add
 * @param [in] transfers ownership, callback on completion
 * @return -error, 0 on success
 */
int append_omap(write_handle_t id, bufferlist bytes, Context *c) = 0;

/**
 * Add xattrs and close object
 *
 * @param [in] write handle to close
 * @param [in] xattrs for object
 * @param [in] transers ownership, callback on completion
 * @return -error, 0 on success
 */
int close_object(write_handle_t id, const map<string, bufferlist> &attrs,
                 Context *c) = 0;
/**
 * remove specified object
 *
 * @param id [in] object to remove
 * @param [in] transers ownership, callback on completion
 * @return -error, 0 on success
 */
int remove_object(const backend_object_id_t &id, Context *c) = 0;

/**
 * open object for reading
 *
 * @param [in] backend id
 * @return read operation id
 */
read_handle_t open_read_operation(backend_object_id_t id) = 0;

/**
 * read bytes
 *
 * @param [in] read operation id
 * @param [in] max to read
 * @param [out] target for read bytes
 * @param [in] transers ownership, callback on completion
 * @return -error, 0 on success, -ERANGE when complete
 */
int read_bytes(
    read_handle_t id, size_t amount, bufferlist *out, Context *c) = 0;

/**
 * read omap
 *
 * @param [in] read operation id
 * @param [in] max to read
 * @param [out] target for read bytes, map<string, bufferlist> encoding
 * @param [in] transers ownership, callback on completion
 * @return -error, 0 on success, -ERANGE when complete
 */
int read_omap(
    read_handle_t id, size_t amount, bufferlist *out, Context *c) = 0;

/**
 * read omap
 *
 * @param [in] read operation id
 * @param [in] max to read
 * @param [out] target for read bytes, map<string, bufferlist> encoding
 * @param [in] transers ownership, callback on completion
 * @return -error, 0 on success
 */
int read_xattrs(
    read_handle_t id, map<string, bufferlist> *out, Context *c) = 0;

/**
 * close read operation
 *
 * @param [in] read operation id
 */
void close_read_operation(read_handle_t id) = 0;
};

The main features of the above interface are:
1) The backend neither knows nor cares what the object is. The interface user gets an uninterpreted json identifier which is the only way to get at the object in the backend. The rationale for arbitrary json is that the backend might have placement information it needs to include.
2) Write is append only, there is no way to re-write to an existing id.
3) Read is stream based, no seeking. Also, reads are stateful -- open, read, close. The rationale is that there might be backends which would benefit from knowing when we are and are not done with an object.

Questions:
1) How do we (or do we want to?) deal with backends like Amazon Glacier (or some other tape bot) which might take minutes/hours to fetch an object?
2) How does this interact with snapshots? We could make sure the redirect object contains all relevant snapshot information and still be able to perform trims -- or we could disallow snapshots.
3) What do we do with orphaned puts? Probably we need to write ahead into the pg log an intent to write object h to a particular backend id and cleanup on osd restart.

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3