Project

General

Profile

Osd - clone from journal on btrfs » History » Version 1

Jessica Mack, 06/07/2015 01:14 AM

1 1 Jessica Mack
h1. Osd - clone from journal on btrfs
2 1 Jessica Mack
3 1 Jessica Mack
h3. Summary
4 1 Jessica Mack
5 1 Jessica Mack
The OSD normally does a double-write, once to the journal, and then to the backing file system.  If we are using btrfs, and the journal is a btrfs file, we can avoid the second write by cloning large writes into their final objects.
6 1 Jessica Mack
7 1 Jessica Mack
h3. Owners
8 1 Jessica Mack
9 1 Jessica Mack
* Samuel Just (Inktank)
10 1 Jessica Mack
11 1 Jessica Mack
h3. Interested Parties
12 1 Jessica Mack
13 1 Jessica Mack
* Sage Weil (Inktank)
14 1 Jessica Mack
* Samuel Just (Inktank)
15 1 Jessica Mack
* Mark Nelson (Inktank)
16 1 Jessica Mack
* Haomai Wang(UnitedStack)
17 1 Jessica Mack
* Anip Patel (Arizona State University)
18 1 Jessica Mack
19 1 Jessica Mack
h3. Current Status
20 1 Jessica Mack
21 1 Jessica Mack
Currently the journal events are opaque lumps of data.  
22 1 Jessica Mack
Journaling is usually done in 'parallel' mode on btrfs, which means the journal and fs writes are queued at the same time.  Clone from journal probably requires that the journal write complete prior to the actual object write.
23 1 Jessica Mack
The journal completion logic cannot currently tell where in the journal file the data portion of the event ended up.
24 1 Jessica Mack
25 1 Jessica Mack
h3. Detailed Description
26 1 Jessica Mack
27 1 Jessica Mack
Rather than writing a second time to the object file, we will instead perform a clone from the journal file, avoiding a second write.
28 1 Jessica Mack
We need to use writeahead journaling when this feature is enabled, either for the entire store, or just for the events/writes that we wish to do clones on.
29 1 Jessica Mack
30 1 Jessica Mack
h3. Work items
31 1 Jessica Mack
32 1 Jessica Mack
h3. Coding tasks
33 1 Jessica Mack
34 1 Jessica Mack
# track offset, length of data portion in the journal event metadata
35 1 Jessica Mack
# record final location in journal for the data portion
36 1 Jessica Mack
# pass final location to journal completion handler
37 1 Jessica Mack
# allow the completion handler to do a clone instead of the normal write if certain conditions are met (write is > some minimum size)
38 1 Jessica Mack
# ensure that journal replay still performs the complete write
39 1 Jessica Mack
# consider a hybrid parallel/writeahead approach where large writes go to journal and then fs, while small writes are still done in parallel.
40 1 Jessica Mack
# modify ceph-deploy or other tools to use journal files when the backing fs is btrfs (instead of a separate partition)
41 1 Jessica Mack
42 1 Jessica Mack
h3. Build / release tasks
43 1 Jessica Mack
44 1 Jessica Mack
# do performance tests to confirm this is a significant improvement
45 1 Jessica Mack
# expand rados test matrix to include all journaling modes
46 1 Jessica Mack
47 1 Jessica Mack
h3. Documentation tasks
48 1 Jessica Mack
49 1 Jessica Mack
# document the option
50 1 Jessica Mack
# document the internals in the internals section