Osd - tiering - object redirects


Create a RADOS redirect primitive and methods for making use of them. A redirect should function analogously to a symlink, allowing an object to be moved to a different pool but still be accessible transparently by clients looking in the old location. This would be underlying infrastructure to support tiering.


  • Sage Weil (Inktank)

Interested Parties

Current Status

Detailed Description

--- data types ---

origin: original object in original location
target: alternative location of object

new fields for object_info_t:

enum redir_state; ///< [origin, target]
object_locator_t redir_oloc; ///< [origin] locator for target object
eversion_t redir_version; ///< [origin, target] when this redirect was set to this target
u8 flags; ///< [origin]
object_locator_t owner_oloc; ///< [target] locator for the origin
eversion_t owner_user_version; ///< [target] user_version, not version!

where the origin states are:

REDIRECT we are pointing to another object
PROMOTING we are copying the target object back to the origin location
DEMOTING we are copying the primary object to the origin location
CLEANUP we have the object, but need to delete the demoted object
DELETING local object is logically non-existent, but we need to clean up target location.

flags are:


- we may want to make PROMOTE_ON_WRITE the only behavior for the initial implementation.

- the demoted object has only 2 states:

TARGET we are pointed to by primary

- primary osd will handle object promote, demote operations (copying to/from alternate location)
- use backend cluster interface to avoid deadlock from throttling ( loic : how can it deadlock from throttling ? sage: hmm, might not be a problem, as long as no recovery operations can block on the redirect state. )

- objecter can also do a SET_REDIRECT operation:
- will erase local object and set redirect metadata

- return redirect metadata with GET_REDIRECT ( loic : without GET_REDIRECT it would transparently try again when receiving a EAGAIN, in the same way an http client would on a 302 ? sage: yeah this is like lstat().. we want to find out if we are a redirect origin or target )

--- osd behavior ---

on read (no flags):
REDIRECT: send EAGAIN with redirect metadata to client
PROMOTING: block or forward. ( loic : what does "forward" mean in this context ? I would understand "block then do the read" )
DELETING: enoent

on read (PROMOTE_ON_READ):
NONE, CLEANUP: do the read
DEMOTING: abort the demotion move to CLEANUP and do the the read
REDIRECT: move to PROMOTING, block then do the read
PROMOTING: block then do the read
DELETING: enoent

on write (no flag);
REDIRECT: forward

on write (promote on write);
move to CLEANUP
move to PROMOTING, block

on delete:
DEMOTING, REDIRECT, PROMOTING, CLEANUP: move to DELETING and queue target object for deletion (as with CLEANUP)
DELETING: no change.

on any op:
TARGET: verify the redir_version matches, or EAGAIN

- if we are doing the redirect request and the target does not exist or the version does not match what the redirect/primary had, retry

- the CLEANUP and DELETING states mean the osd needs to remove the redirect and then transition to NONE or delete (respectively)

--- objecter behavior ---

- send op to normal location
- on EAGAIN with redirect metadata,

- note redirect version
- if this is a retry and version hasn't changed, return error to caller.
- resend op to alternate location, including the primary's eversion_t
- if we get an error (ENOENT on read), retry from the top

--- pg log events ---

redir_demote_start -- we are now allowed to start copying to target pool. move to DEMOTING
redir_demote_finish -- target is in place; delete local data and set redirect metadata. move to REDIRECT
redir_promote_cleanup -- did copy from target back to origin; still need to clean up old target. move to CLEANUP
redir_cleanup_finish -- old target is cleaned up. move to NONE
redir_delete_start -- can remove target, move to DELETING
remove (existing event) -- finished removing target, delete object.

--- common races ---

- read vs demote

- if we hit primary while DEMOTING, we get the result
- if we get EAGAIN, we read from teh demoted copy

- read vs promote (or read vs demote+prmote)

- try primary
- EAGAIN, try alternate location
- result, or ENOENT and back to primary (and block->success or success)

- if PROMOTING, block, then success

--- in-memory osd state ---

For each PG, we maintain:
  • set<Demotion*> redir_demoting; ///< all pending demotions
  • set<Promotion*> redir_promotion; ///< all pending promotions
  • set<Cleanup*> redir_cleanup; ///< all pending cleanups/deletions.

These structs will have a ref to the ObjectContext and will need to orchestrate the push/pull to do the promotion/demotion. They will reuse all of the push/pull helpers used by recovery.

--- snapshots ---
We can start with a simple approach, and add more complex bheavior from there.
  1. Force promote-on-write if a non-empty SnapContext is specified. This ensures that all the snap metadata lives in the main pool and makes sense. Similarly, we refuse to demote anything that is snapped.
  2. Allow snaps to be demoted. For teh primary pool, recovery needs to be adjusted so that the clone_range stuff falls back to a full copy when the snap is a redirect. In the target pool, recovery needs to behave when we have a subset of the snapset... i.e. just the snapped object. It may be simplest if it is not a snap at all: foo @12 -> foo_$version @nosnap with key foo. And writes/cow never happen in the cold pool.

--- clonerange ---
If a source item for a clonerange is a redirect, block and promote.

Work items

Coding tasks

  1. osd: add object_info_t fields for redirects
  2. add redirect metadata to MOSDOp, MOSDOpReply.
  3. add a feature bit.
  4. osd, objecter, librados, api tests: SET_REDIRECT, GET_REDIRECT operations
  5. osd: basic redirect logic: reply with EAGAIN on primary, verify or EAGAIN on target.
  6. osd: EINVAL or similar if client lacks feature.
  7. objecter: handle EAGAIN redirects
  8. osd: pg log entries to indicate state changes (none -> demoting -> redirect -> promoting -> cleanup, deleting, etc.)
  9. osd: per-PG map of pending redirect states (demoting, promoting, cleanup, tombstone)
  10. osd: log replay to update pending redirect states
  11. osd: support deletion. refactoring to support tombstones.
  12. osd: promote
  13. osd: demote
  14. osd: allow snap

Build / release tasks

  1. add promote/demote to RadosModel

Documentation tasks

  1. Task 1