Rados - metadata-only journal mode » History » Version 11
Li Wang, 06/30/2015 09:42 AM
1 | 1 | Li Wang | h1. Rados - metadata-only journal mode |
---|---|---|---|
2 | 2 | Li Wang | |
3 | *Summary* |
||
4 | 7 | Li Wang | Currently the Ceph community is thinking of eliminating the double write |
5 | penalty of write ahead logging, newstore is a great design which implements |
||
6 | create, append operations in an copy on read way, while maintaining all |
||
7 | the original semantics. This makes newstore a general purpose optimization, |
||
8 | especially suitable for the write once scenarios. Metadata-only journal mode |
||
9 | intends to do in a more aggressive way, that is, not journal object data at all. |
||
10 | This applies to two major kinds of situations, one is that the atomicity for |
||
11 | object data modification may not need, for example, RBD to simulate a disk |
||
12 | in cloud platform. The second is those double journaling situations, for example, |
||
13 | cache tiering, while cache pool has already provided the durability, when dirty |
||
14 | objects are written back, they theoretically need not go through the journaling |
||
15 | process of base pool, since the flusher could always replay the write operation. |
||
16 | Metadata-only journal mode, to some extent, resembles the data=ordered journal |
||
17 | mode in ext4. With such journal mode is on, object data are written directly to |
||
18 | their ultimate location, when data written finished, metadata are written into the |
||
19 | journal. It guarantees the consistency in terms of RADOS name space, and the data |
||
20 | consistency among object copies. However, the object data may not be correct. |
||
21 | Later we will demonstrate that this rarely happens. |
||
22 | 2 | Li Wang | |
23 | *Owners* |
||
24 | |||
25 | Li Wang (liwang@ubuntukylin.com) |
||
26 | 4 | Li Wang | Yunchuan Wen (yunchuanwen@ubuntukylin.com) |
27 | 2 | Li Wang | Name |
28 | |||
29 | *Interested Parties* |
||
30 | If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here. |
||
31 | Name (Affiliation) |
||
32 | Name (Affiliation) |
||
33 | Name |
||
34 | |||
35 | *Current Status* |
||
36 | Please describe the current status of Ceph as it relates to this blueprint. Is there something that this replaces? Are there current features that are related? |
||
37 | |||
38 | *Detailed Description* |
||
39 | 8 | Li Wang | The algorithm is as follows, |
40 | |||
41 | 11 | Li Wang | 1 Submit transaction A into journal, add a record for <offset, length> |
42 | non-journaling data write in omap, |
||
43 | 1 | Li Wang | 2 Write data to object |
44 | 3 Submit transaction B into journal, to update the metadata as well as |
||
45 | 8 | Li Wang | pglog as usual, and revert the operations of transaction A |
46 | 1 | Li Wang | |
47 | 8 | Li Wang | As long as one osd in the pg has succeeded, the pg will be recovered to |
48 | a consistent and correct state by peering; If the PG down as a whole, |
||
49 | 9 | Li Wang | there are the following situations, |
50 | (1) None of the osds finishes step 1, nothing happen; |
||
51 | (2) At least one of the osds finishes step 3, journaling and |
||
52 | peering will recover the pg to a consistent and correct state; |
||
53 | 10 | Li Wang | (3) none of the osds has finished step (3), and at least one of the osds has |
54 | 9 | Li Wang | finished step (1), this is the only potentially problematical situation, |
55 | in this case, we revise peering or scrub to make them realize the |
||
56 | semantics of transaction A, and randomly choose one osd to synchronize its |
||
57 | content of written area to other copies. We prefer to leave it the scrub's job. |
||
58 | Since scrub is done asynchronously, and maybe be scheduled to run late, during |
||
59 | this period, client's resend may have recovered the content to consistent. |
||
60 | 2 | Li Wang | |
61 | *Work items* |
||
62 | This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well. |
||
63 | |||
64 | *Coding tasks* |
||
65 | Task 1 |
||
66 | Task 2 |
||
67 | Task 3 |
||
68 | |||
69 | *Build / release tasks* |
||
70 | Task 1 |
||
71 | Task 2 |
||
72 | Task 3 |
||
73 | |||
74 | *Documentation tasks* |
||
75 | Task 1 |
||
76 | Task 2 |
||
77 | Task 3 |
||
78 | |||
79 | *Deprecation tasks* |
||
80 | Task 1 |
||
81 | Task 2 |
||
82 | 1 | Li Wang | Task 3 |