Osd - tiering - object redirects » History » Version 1
Jessica Mack, 06/09/2015 07:25 PM
1 | 1 | Jessica Mack | h1. Osd - tiering - object redirects |
---|---|---|---|
2 | |||
3 | h3. Summary |
||
4 | |||
5 | Create a RADOS redirect primitive and methods for making use of them. A redirect should function analogously to a symlink, allowing an object to be moved to a different pool but still be accessible transparently by clients looking in the old location. This would be underlying infrastructure to support tiering. |
||
6 | |||
7 | h3. Owners |
||
8 | |||
9 | * Sage Weil (Inktank) |
||
10 | |||
11 | h3. Interested Parties |
||
12 | |||
13 | * Loic Dachary <loic@dachary.org> |
||
14 | * Sam Just (Inktank) |
||
15 | |||
16 | h3. Current Status |
||
17 | |||
18 | h3. Detailed Description |
||
19 | |||
20 | *--- data types ---* |
||
21 | terminology |
||
22 | |||
23 | p((. origin: original object in original location |
||
24 | target: alternative location of object |
||
25 | |||
26 | new fields for object_info_t: |
||
27 | |||
28 | p((. @enum redir_state; ///< [origin, target] |
||
29 | object_locator_t redir_oloc; ///< [origin] locator for target object |
||
30 | eversion_t redir_version; ///< [origin, target] when this redirect was set to this target |
||
31 | u8 flags; ///< [origin] |
||
32 | object_locator_t owner_oloc; ///< [target] locator for the origin |
||
33 | eversion_t owner_user_version; ///< [target] user_version, not version!@ |
||
34 | |||
35 | where the origin states are: |
||
36 | |||
37 | p((. *NONE* |
||
38 | *REDIRECT* we are pointing to another object |
||
39 | *PROMOTING* we are copying the target object back to the origin location |
||
40 | *DEMOTING* we are copying the primary object to the origin location |
||
41 | *CLEANUP* we have the object, but need to delete the demoted object |
||
42 | *DELETING* local object is logically non-existent, but we need to clean up target location. |
||
43 | |||
44 | flags are: |
||
45 | |||
46 | p((. PROMOTE_ON_READ |
||
47 | PROMOTE_ON_WRITE |
||
48 | |||
49 | - we may want to make PROMOTE_ON_WRITE the only behavior for the initial implementation. |
||
50 | |||
51 | - the demoted object has only 2 states: |
||
52 | |||
53 | p((. *NONE* |
||
54 | *TARGET* we are pointed to by primary |
||
55 | |||
56 | - primary osd will handle object promote, demote operations (copying to/from alternate location) |
||
57 | - use backend cluster interface to avoid deadlock from throttling ( loic : how can it deadlock from throttling ? sage: hmm, might not be a problem, as long as no recovery operations can block on the redirect state. ) |
||
58 | |||
59 | - objecter can also do a SET_REDIRECT operation: |
||
60 | - will erase local object and set redirect metadata |
||
61 | |||
62 | - return redirect metadata with GET_REDIRECT ( loic : without GET_REDIRECT it would transparently try again when receiving a EAGAIN, in the same way an http client would on a 302 ? sage: yeah this is like lstat().. we want to find out if we are a redirect origin or target ) |
||
63 | |||
64 | *--- osd behavior ---* |
||
65 | |||
66 | on read (no flags): |
||
67 | NONE, DEMOTING, CLEANUP: do the read |
||
68 | REDIRECT: send EAGAIN with redirect metadata to client |
||
69 | PROMOTING: block or forward. ( loic : what does "forward" mean in this context ? I would understand "block then do the read" ) |
||
70 | DELETING: enoent |
||
71 | |||
72 | on read (PROMOTE_ON_READ): |
||
73 | NONE, CLEANUP: do the read |
||
74 | DEMOTING: abort the demotion move to CLEANUP and do the the read |
||
75 | REDIRECT: move to PROMOTING, block then do the read |
||
76 | PROMOTING: block then do the read |
||
77 | DELETING: enoent |
||
78 | |||
79 | on write (no flag); |
||
80 | DEMOTING: block |
||
81 | REDIRECT: forward |
||
82 | PROMOTING: block |
||
83 | DELETING: CLEANUP, proceed. |
||
84 | |||
85 | on write (promote on write); |
||
86 | DEMOTING: |
||
87 | move to CLEANUP |
||
88 | REDIRECT: |
||
89 | move to PROMOTING, block |
||
90 | PROMOTING: block |
||
91 | DELETING: CLEANUP, proceed. |
||
92 | |||
93 | on delete: |
||
94 | DEMOTING, REDIRECT, PROMOTING, CLEANUP: move to DELETING and queue target object for deletion (as with CLEANUP) |
||
95 | DELETING: no change. |
||
96 | |||
97 | on any op: |
||
98 | TARGET: verify the redir_version matches, or EAGAIN |
||
99 | |||
100 | - if we are doing the redirect request and the target does not exist or the version does not match what the redirect/primary had, retry |
||
101 | |||
102 | - the CLEANUP and DELETING states mean the osd needs to remove the redirect and then transition to NONE or delete (respectively) |
||
103 | |||
104 | *--- objecter behavior ---* |
||
105 | |||
106 | - send op to normal location |
||
107 | - on EAGAIN with redirect metadata, |
||
108 | |||
109 | p((. - note redirect version |
||
110 | - if this is a retry and version hasn't changed, return error to caller. |
||
111 | - resend op to alternate location, *including* the primary's eversion_t |
||
112 | - if we get an error (ENOENT on read), retry from the top |
||
113 | |||
114 | *--- pg log events ---* |
||
115 | |||
116 | redir_demote_start -- we are now allowed to start copying to target pool. move to DEMOTING |
||
117 | redir_demote_finish -- target is in place; delete local data and set redirect metadata. move to REDIRECT |
||
118 | redir_promote_cleanup -- did copy from target back to origin; still need to clean up old target. move to CLEANUP |
||
119 | redir_cleanup_finish -- old target is cleaned up. move to NONE |
||
120 | redir_delete_start -- can remove target, move to DELETING |
||
121 | remove (existing event) -- finished removing target, delete object. |
||
122 | |||
123 | *--- common races ---* |
||
124 | |||
125 | - read vs demote |
||
126 | |||
127 | p((. - if we hit primary while DEMOTING, we get the result |
||
128 | - if we get EAGAIN, we read from teh demoted copy |
||
129 | |||
130 | - read vs promote (or read vs demote+prmote) |
||
131 | |||
132 | p((. - try primary |
||
133 | - if REDIRECT: |
||
134 | - EAGAIN, try alternate location |
||
135 | - result, or ENOENT and back to primary (and block->success or success) |
||
136 | |||
137 | - if PROMOTING, block, then success |
||
138 | |||
139 | *--- in-memory osd state ---* |
||
140 | |||
141 | For each PG, we maintain: |
||
142 | * set<Demotion*> redir_demoting; ///< all pending demotions |
||
143 | * set<Promotion*> redir_promotion; ///< all pending promotions |
||
144 | * set<Cleanup*> redir_cleanup; ///< all pending cleanups/deletions. |
||
145 | |||
146 | These structs will have a ref to the ObjectContext and will need to orchestrate the push/pull to do the promotion/demotion. They will reuse all of the push/pull helpers used by recovery. |
||
147 | |||
148 | *--- snapshots ---* |
||
149 | We can start with a simple approach, and add more complex bheavior from there. |
||
150 | # Force promote-on-write if a non-empty SnapContext is specified. This ensures that all the snap metadata lives in the main pool and makes sense. Similarly, we refuse to demote anything that is snapped. |
||
151 | # Allow snaps to be demoted. For teh primary pool, recovery needs to be adjusted so that the clone_range stuff falls back to a full copy when the snap is a redirect. In the target pool, recovery needs to behave when we have a subset of the snapset... i.e. just the snapped object. It may be simplest if it is not a snap at all: foo @12 -> foo_$version @nosnap with key foo. And writes/cow never happen in the cold pool. |
||
152 | |||
153 | *--- clonerange ---* |
||
154 | If a source item for a clonerange is a redirect, block and promote. |
||
155 | |||
156 | h3. Work items |
||
157 | |||
158 | h3. Coding tasks |
||
159 | |||
160 | # osd: add object_info_t fields for redirects |
||
161 | # add redirect metadata to MOSDOp, MOSDOpReply. |
||
162 | # add a feature bit. |
||
163 | # osd, objecter, librados, api tests: SET_REDIRECT, GET_REDIRECT operations |
||
164 | # osd: basic redirect logic: reply with EAGAIN on primary, verify or EAGAIN on target. |
||
165 | # osd: EINVAL or similar if client lacks feature. |
||
166 | # objecter: handle EAGAIN redirects |
||
167 | # osd: pg log entries to indicate state changes (none -> demoting -> redirect -> promoting -> cleanup, deleting, etc.) |
||
168 | # osd: per-PG map of pending redirect states (demoting, promoting, cleanup, tombstone) |
||
169 | # osd: log replay to update pending redirect states |
||
170 | # osd: support deletion. refactoring to support tombstones. |
||
171 | # osd: promote |
||
172 | # osd: demote |
||
173 | # osd: allow snap |
||
174 | |||
175 | h3. Build / release tasks |
||
176 | |||
177 | # add promote/demote to RadosModel |
||
178 | |||
179 | h3. Documentation tasks |
||
180 | |||
181 | # Task 1 |
||
182 |