Project

General

Profile

Rados - metadata-only journal mode » History » Version 11

Li Wang, 06/30/2015 09:42 AM

1 1 Li Wang
h1. Rados - metadata-only journal mode
2 2 Li Wang
3
*Summary*
4 7 Li Wang
Currently the Ceph community is thinking of eliminating the double write
5
penalty of write ahead logging, newstore is a great design which implements 
6
create, append operations in an copy on read way, while maintaining all
7
the original semantics. This makes newstore a general purpose optimization,
8
especially suitable for the write once scenarios. Metadata-only journal mode
9
intends to do in a more aggressive way, that is, not journal object data at all.
10
This applies to two major kinds of situations, one is that the atomicity for 
11
object data modification may not need, for example, RBD to simulate a disk
12
in cloud platform. The second is those double journaling situations, for example,
13
cache tiering, while cache pool has already provided the durability, when dirty 
14
objects are written back, they theoretically need not go through the journaling 
15
process of base pool, since the flusher could always replay the write operation. 
16
Metadata-only journal mode, to some extent, resembles the data=ordered journal 
17
mode in ext4. With such journal mode is on, object data are written directly to 
18
their ultimate location, when data written finished, metadata are written into the 
19
journal. It guarantees the consistency in terms of RADOS name space, and the data 
20
consistency among object copies. However, the object data may not be correct. 
21
Later we will demonstrate that this rarely happens.
22 2 Li Wang
23
*Owners*
24
25
Li Wang (liwang@ubuntukylin.com)
26 4 Li Wang
Yunchuan Wen (yunchuanwen@ubuntukylin.com)
27 2 Li Wang
Name
28
29
*Interested Parties*
30
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
31
Name (Affiliation)
32
Name (Affiliation)
33
Name
34
35
*Current Status*
36
Please describe the current status of Ceph as it relates to this blueprint.  Is there something that this replaces?  Are there current features that are related?
37
38
*Detailed Description*
39 8 Li Wang
The algorithm is as follows,
40
41 11 Li Wang
1 Submit transaction A into journal, add a record for <offset, length>
42
non-journaling data write in omap, 
43 1 Li Wang
2 Write data to object
44
3 Submit transaction B into journal, to update the metadata as well as
45 8 Li Wang
pglog as usual, and revert the operations of transaction A
46 1 Li Wang
47 8 Li Wang
As long as one osd in the pg has succeeded, the pg will be recovered to
48
a consistent and correct state by peering; If the PG down as a whole,
49 9 Li Wang
there are the following situations, 
50
(1) None of the osds finishes step 1, nothing happen; 
51
(2) At least one of the osds finishes step 3, journaling and
52
peering will recover the pg to a consistent and correct state; 
53 10 Li Wang
(3) none of the osds has finished step (3), and at least one of the osds has 
54 9 Li Wang
finished step (1), this is the only potentially problematical situation, 
55
in this case, we revise peering or scrub to make them realize the 
56
semantics of transaction A, and randomly choose one osd to synchronize its 
57
content of written area to other copies. We prefer to leave it the scrub's job. 
58
Since scrub is done asynchronously, and maybe be scheduled to run late, during 
59
this period, client's resend may have recovered the content to consistent.
60 2 Li Wang
61
*Work items*
62
This section should contain a list of work tasks created by this blueprint.  Please include engineering tasks as well as related build/release and documentation work.  If this blueprint requires cleanup of deprecated features, please list those tasks as well.
63
64
*Coding tasks*
65
Task 1
66
Task 2
67
Task 3
68
69
*Build / release tasks*
70
Task 1
71
Task 2
72
Task 3
73
74
*Documentation tasks*
75
Task 1
76
Task 2
77
Task 3
78
79
*Deprecation tasks*
80
Task 1
81
Task 2
82 1 Li Wang
Task 3