Project

General

Profile

Rados - metadata-only journal mode » History » Version 6

Li Wang, 06/23/2015 08:22 AM

1 1 Li Wang
h1. Rados - metadata-only journal mode
2 2 Li Wang
3
*Summary*
4 6 Li Wang
This is for metadata-only journal mode. In this mode, for write operation,
5
the OSD will journal only metadata, without journaling the written data.
6
7
An important usage of Ceph is to 
8 3 Li Wang
integrate with cloud computing platform to provide the storage for VM 
9
images and instances. In such scenario, qemu maps RBD as virtual block 
10
devices, i.e., disks to a VM, and the guest operating system will format 
11
the disks and create file systems on them. In this case, RBD mostly 
12
resembles a 'dumb' disk.  In other words, it is enough for RBD to 
13
implement exactly the semantics of a disk controller driver. Typically, 
14
the disk controller itself does
15
not provide a transactional mechanism to ensure a write operation done
16
atomically. Instead, it is up to the file system, who manages the disk,
17
to adopt some techniques such as journaling to prevent inconsistency,
18
if necessary. Consequently, RBD does not need to provide the
19
atomic mechanism to ensure a data write operation done atomically,
20
since the guest file system will guarantee that its write operations to
21
RBD will remain consistent by using journaling if needed. Another
22
scenario is for the cache tiering, while cache pool has already
23
provided the durability, when dirty objects are written back, they
24
theoretically need not go through the journaling process of base pool,
25
since the flusher could replay the write operation. These motivate us
26
to implement a new journal mode, metadata-only journal mode, which
27
resembles the data=ordered journal mode in ext4. With such journal mode
28
is on, object data are written directly to their ultimate location,
29
when data written finished, metadata are written into the journal, then
30
the write returns to caller. This will avoid the double-write penalty
31
of object data due to the WRITE-AHEAD-LOGGING, potentially greatly
32
improve the RBD and cache tiering performance. 
33 2 Li Wang
34
*Owners*
35
36
Li Wang (liwang@ubuntukylin.com)
37 4 Li Wang
Yunchuan Wen (yunchuanwen@ubuntukylin.com)
38 2 Li Wang
Name
39
40
*Interested Parties*
41
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
42
Name (Affiliation)
43
Name (Affiliation)
44
Name
45
46
*Current Status*
47
Please describe the current status of Ceph as it relates to this blueprint.  Is there something that this replaces?  Are there current features that are related?
48
49
*Detailed Description*
50 5 Li Wang
1  Submit transaction A into journal, mark a <offset, length>
51
non-journaling data write in pglog (peering) or omap/xattrs(scrub)
52
2 Write data to object
53
3 Submit transaction B into journal, to update the metadata as well as
54
pglog as usual 
55
56
As long as one OSD in the PG has succeeded, the PG will be recovered to 
57
a consistent and correct state by peering; The only potentially problematical 
58
situation is the PG down as a whole, and none of the OSDs has finished Step (3), 
59
and at least one of the OSDs has finished Step (1). In that case, 
60
we revise peering or scrub to make them realize the semantics of transaction A, 
61
and randomly choose one osd to synchronize its content of written area to other
62
copies. We prefer to leave it the scrub's job. Since scrub is done
63
asynchronously, and maybe be scheduled to run late, during this period, 
64
client's resend may have recovered the content to consistent. 
65 2 Li Wang
66
*Work items*
67
This section should contain a list of work tasks created by this blueprint.  Please include engineering tasks as well as related build/release and documentation work.  If this blueprint requires cleanup of deprecated features, please list those tasks as well.
68
69
*Coding tasks*
70
Task 1
71
Task 2
72
Task 3
73
74
*Build / release tasks*
75
Task 1
76
Task 2
77
Task 3
78
79
*Documentation tasks*
80
Task 1
81
Task 2
82
Task 3
83
84
*Deprecation tasks*
85
Task 1
86
Task 2
87
Task 3