Project

General

Profile

Rados - metadata-only journal mode » History » Version 14

Li Wang, 07/01/2015 10:22 AM

1 1 Li Wang
h1. Rados - metadata-only journal mode
2 2 Li Wang
3 2 Li Wang
*Summary*
4 7 Li Wang
Currently the Ceph community is thinking of eliminating the double write
5 7 Li Wang
penalty of write ahead logging, newstore is a great design which implements 
6 7 Li Wang
create, append operations in an copy on read way, while maintaining all
7 7 Li Wang
the original semantics. This makes newstore a general purpose optimization,
8 7 Li Wang
especially suitable for the write once scenarios. Metadata-only journal mode
9 7 Li Wang
intends to do in a more aggressive way, that is, not journal object data at all.
10 7 Li Wang
This applies to two major kinds of situations, one is that the atomicity for 
11 7 Li Wang
object data modification may not need, for example, RBD to simulate a disk
12 7 Li Wang
in cloud platform. The second is those double journaling situations, for example,
13 7 Li Wang
cache tiering, while cache pool has already provided the durability, when dirty 
14 7 Li Wang
objects are written back, they theoretically need not go through the journaling 
15 7 Li Wang
process of base pool, since the flusher could always replay the write operation. 
16 7 Li Wang
Metadata-only journal mode, to some extent, resembles the data=ordered journal 
17 7 Li Wang
mode in ext4. With such journal mode is on, object data are written directly to 
18 7 Li Wang
their ultimate location, when data written finished, metadata are written into the 
19 7 Li Wang
journal. It guarantees the consistency in terms of RADOS name space, and the data 
20 7 Li Wang
consistency among object copies. However, the object data may not be correct. 
21 7 Li Wang
Later we will demonstrate that this rarely happens.
22 2 Li Wang
23 2 Li Wang
*Owners*
24 2 Li Wang
25 2 Li Wang
Li Wang (liwang@ubuntukylin.com)
26 4 Li Wang
Yunchuan Wen (yunchuanwen@ubuntukylin.com)
27 2 Li Wang
Name
28 2 Li Wang
29 2 Li Wang
*Interested Parties*
30 2 Li Wang
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
31 2 Li Wang
Name (Affiliation)
32 2 Li Wang
Name (Affiliation)
33 2 Li Wang
Name
34 2 Li Wang
35 2 Li Wang
*Current Status*
36 2 Li Wang
Please describe the current status of Ceph as it relates to this blueprint.  Is there something that this replaces?  Are there current features that are related?
37 2 Li Wang
38 2 Li Wang
*Detailed Description*
39 12 Li Wang
We have two options,
40 13 Li Wang
The first option
41 13 Li Wang
42 14 Li Wang
The only revision lies in that it does not journal object data, when the transaction is committing, the data are written into object directly. In most cases, this will not introduce problem relying on the powerful peering and client resent mechanism. The only problematic situation is PG down as a whole, and client also down, in that case, the guest fs in the vm  will possibly recover it to consistent by fsck and journal replaying. So it just to leave scrub to find and fix this by randomly synchronizing one of the copy to others.
43 12 Li Wang
44 13 Li Wang
The second option
45 1 Li Wang
46 13 Li Wang
1 Submit transaction A into journal, add a record for <offset, length> non-journaling data write in omap
47 8 Li Wang
2 Write data to object
48 12 Li Wang
3 Submit transaction B into journal, to update the metadata, and revert the operations of transaction A
49 8 Li Wang
50 1 Li Wang
As long as one osd in the pg has succeeded, the pg will be recovered to
51 12 Li Wang
a consistent and correct state by peering; If the pg down as a whole,
52 9 Li Wang
there are the following situations, 
53 9 Li Wang
(1) None of the osds finishes step 1, nothing happen; 
54 10 Li Wang
(2) At least one of the osds finishes step 3, journaling and
55 9 Li Wang
peering will recover the pg to a consistent and correct state; 
56 9 Li Wang
(3) none of the osds has finished step (3), and at least one of the osds has 
57 9 Li Wang
finished step (1), this is the only potentially problematical situation, 
58 12 Li Wang
in this case, peering will synchronize the omap record to other osds in the pg. 
59 12 Li Wang
For object read and write, if found the record, it will let the operation wait and
60 12 Li Wang
start a recovery to randomly choose one osd to synchronize its 
61 12 Li Wang
content of written area to other copies. During scrub, it will also check the record and do the recvery.
62 2 Li Wang
63 2 Li Wang
*Work items*
64 2 Li Wang
This section should contain a list of work tasks created by this blueprint.  Please include engineering tasks as well as related build/release and documentation work.  If this blueprint requires cleanup of deprecated features, please list those tasks as well.
65 2 Li Wang
66 2 Li Wang
*Coding tasks*
67 2 Li Wang
Task 1
68 2 Li Wang
Task 2
69 2 Li Wang
Task 3
70 2 Li Wang
71 2 Li Wang
*Build / release tasks*
72 2 Li Wang
Task 1
73 2 Li Wang
Task 2
74 2 Li Wang
Task 3
75 2 Li Wang
76 2 Li Wang
*Documentation tasks*
77 2 Li Wang
Task 1
78 2 Li Wang
Task 2
79 2 Li Wang
Task 3
80 2 Li Wang
81 2 Li Wang
*Deprecation tasks*
82 2 Li Wang
Task 1
83 1 Li Wang
Task 2
84 1 Li Wang
Task 3