Rados - metadata-only journal mode » History » Version 5
Li Wang, 06/23/2015 08:17 AM
1 | 1 | Li Wang | h1. Rados - metadata-only journal mode |
---|---|---|---|
2 | 2 | Li Wang | |
3 | *Summary* |
||
4 | 3 | Li Wang | This is for metadata-only journal mode. An important usage of Ceph is to |
5 | integrate with cloud computing platform to provide the storage for VM |
||
6 | images and instances. In such scenario, qemu maps RBD as virtual block |
||
7 | devices, i.e., disks to a VM, and the guest operating system will format |
||
8 | the disks and create file systems on them. In this case, RBD mostly |
||
9 | resembles a 'dumb' disk. In other words, it is enough for RBD to |
||
10 | implement exactly the semantics of a disk controller driver. Typically, |
||
11 | the disk controller itself does |
||
12 | not provide a transactional mechanism to ensure a write operation done |
||
13 | atomically. Instead, it is up to the file system, who manages the disk, |
||
14 | to adopt some techniques such as journaling to prevent inconsistency, |
||
15 | if necessary. Consequently, RBD does not need to provide the |
||
16 | atomic mechanism to ensure a data write operation done atomically, |
||
17 | since the guest file system will guarantee that its write operations to |
||
18 | RBD will remain consistent by using journaling if needed. Another |
||
19 | scenario is for the cache tiering, while cache pool has already |
||
20 | provided the durability, when dirty objects are written back, they |
||
21 | theoretically need not go through the journaling process of base pool, |
||
22 | since the flusher could replay the write operation. These motivate us |
||
23 | to implement a new journal mode, metadata-only journal mode, which |
||
24 | resembles the data=ordered journal mode in ext4. With such journal mode |
||
25 | is on, object data are written directly to their ultimate location, |
||
26 | when data written finished, metadata are written into the journal, then |
||
27 | the write returns to caller. This will avoid the double-write penalty |
||
28 | of object data due to the WRITE-AHEAD-LOGGING, potentially greatly |
||
29 | improve the RBD and cache tiering performance. |
||
30 | 2 | Li Wang | |
31 | *Owners* |
||
32 | |||
33 | Li Wang (liwang@ubuntukylin.com) |
||
34 | 4 | Li Wang | Yunchuan Wen (yunchuanwen@ubuntukylin.com) |
35 | 2 | Li Wang | Name |
36 | |||
37 | *Interested Parties* |
||
38 | If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here. |
||
39 | Name (Affiliation) |
||
40 | Name (Affiliation) |
||
41 | Name |
||
42 | |||
43 | *Current Status* |
||
44 | Please describe the current status of Ceph as it relates to this blueprint. Is there something that this replaces? Are there current features that are related? |
||
45 | |||
46 | *Detailed Description* |
||
47 | 5 | Li Wang | 1 Submit transaction A into journal, mark a <offset, length> |
48 | non-journaling data write in pglog (peering) or omap/xattrs(scrub) |
||
49 | 2 Write data to object |
||
50 | 3 Submit transaction B into journal, to update the metadata as well as |
||
51 | pglog as usual |
||
52 | |||
53 | As long as one OSD in the PG has succeeded, the PG will be recovered to |
||
54 | a consistent and correct state by peering; The only potentially problematical |
||
55 | situation is the PG down as a whole, and none of the OSDs has finished Step (3), |
||
56 | and at least one of the OSDs has finished Step (1). In that case, |
||
57 | we revise peering or scrub to make them realize the semantics of transaction A, |
||
58 | and randomly choose one osd to synchronize its content of written area to other |
||
59 | copies. We prefer to leave it the scrub's job. Since scrub is done |
||
60 | asynchronously, and maybe be scheduled to run late, during this period, |
||
61 | client's resend may have recovered the content to consistent. |
||
62 | 2 | Li Wang | |
63 | *Work items* |
||
64 | This section should contain a list of work tasks created by this blueprint. Please include engineering tasks as well as related build/release and documentation work. If this blueprint requires cleanup of deprecated features, please list those tasks as well. |
||
65 | |||
66 | *Coding tasks* |
||
67 | Task 1 |
||
68 | Task 2 |
||
69 | Task 3 |
||
70 | |||
71 | *Build / release tasks* |
||
72 | Task 1 |
||
73 | Task 2 |
||
74 | Task 3 |
||
75 | |||
76 | *Documentation tasks* |
||
77 | Task 1 |
||
78 | Task 2 |
||
79 | Task 3 |
||
80 | |||
81 | *Deprecation tasks* |
||
82 | Task 1 |
||
83 | Task 2 |
||
84 | Task 3 |