Rados - metadata-only journal mode » History » Version 8
Li Wang, 06/30/2015 02:56 AM
1 | 1 | Li Wang | h1. Rados - metadata-only journal mode |
---|---|---|---|
2 | 2 | Li Wang | |
3 | *Summary* |
||
4 | 7 | Li Wang | Currently the Ceph community is thinking of eliminating the double write |
5 | penalty of write ahead logging, newstore is a great design which implements |
||
6 | create, append operations in an copy on read way, while maintaining all |
||
7 | the original semantics. This makes newstore a general purpose optimization, |
||
8 | especially suitable for the write once scenarios. Metadata-only journal mode |
||
9 | intends to do in a more aggressive way, that is, not journal object data at all. |
||
10 | This applies to two major kinds of situations, one is that the atomicity for |
||
11 | object data modification may not need, for example, RBD to simulate a disk |
||
12 | in cloud platform. The second is those double journaling situations, for example, |
||
13 | cache tiering, while cache pool has already provided the durability, when dirty |
||
14 | objects are written back, they theoretically need not go through the journaling |
||
15 | process of base pool, since the flusher could always replay the write operation. |
||
16 | Metadata-only journal mode, to some extent, resembles the data=ordered journal |
||
17 | mode in ext4. With such journal mode is on, object data are written directly to |
||
18 | their ultimate location, when data written finished, metadata are written into the |
||
19 | journal. It guarantees the consistency in terms of RADOS name space, and the data |
||
20 | consistency among object copies. However, the object data may not be correct. |
||
21 | Later we will demonstrate that this rarely happens. |
||
22 | 2 | Li Wang | |
23 | *Owners* |
||
24 | |||
25 | Li Wang (liwang@ubuntukylin.com) |
||
26 | 4 | Li Wang | Yunchuan Wen (yunchuanwen@ubuntukylin.com) |
27 | 2 | Li Wang | Name |
28 | |||
29 | *Interested Parties* |
||
30 | If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here. |
||
31 | Name (Affiliation) |
||
32 | Name (Affiliation) |
||
33 | Name |
||
34 | |||
35 | *Current Status* |
||
36 | Please describe the current status of Ceph as it relates to this blueprint. Is there something that this replaces? Are there current features that are related? |
||
37 | |||
38 | *Detailed Description* |
||
39 | 8 | Li Wang | The algorithm is as follows, |
40 | |||
41 | 1 Submit transaction A into journal, mark a <offset, length> |
||
42 | 5 | Li Wang | non-journaling data write in pglog (peering) or omap/xattrs(scrub) |
43 | 1 | Li Wang | 2 Write data to object |
44 | 3 Submit transaction B into journal, to update the metadata as well as |
||
45 | 8 | Li Wang | pglog as usual, and revert the operations of transaction A |
46 | 1 | Li Wang | |
47 | 8 | Li Wang | As long as one osd in the pg has succeeded, the pg will be recovered to |
48 | a consistent and correct state by peering; If the PG down as a whole, |
||
49 | there are the following situations, (1) None of the osds finishes step 1, |
||
50 | nothing happen; (2) At least one of the osds finishes step 3, journaling and |
||
51 | peering will recover the pg to a consistent and correct state; (3) none of the |
||
52 | osds has finished Step (3), and at least one of the osds has finished step (1), |
||
53 | this is the only potentially problematical situation, in this case, |
||
54 | we revise peering or scrub to make them realize the semantics of transaction A, |
||
55 | 5 | Li Wang | and randomly choose one osd to synchronize its content of written area to other |
56 | copies. We prefer to leave it the scrub's job. Since scrub is done |
||
57 | 8 | Li Wang | asynchronously, and maybe be scheduled to run late, during this period, |
58 | client's resend may have recovered the content to consistent. |
||
59 | 2 | Li Wang | |
60 | *Work items* |
||
61 | This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well. |
||
62 | |||
63 | *Coding tasks* |
||
64 | Task 1 |
||
65 | Task 2 |
||
66 | Task 3 |
||
67 | |||
68 | *Build / release tasks* |
||
69 | Task 1 |
||
70 | Task 2 |
||
71 | Task 3 |
||
72 | |||
73 | *Documentation tasks* |
||
74 | Task 1 |
||
75 | Task 2 |
||
76 | Task 3 |
||
77 | |||
78 | *Deprecation tasks* |
||
79 | Task 1 |
||
80 | Task 2 |
||
81 | 1 | Li Wang | Task 3 |