Osd - new keyvalue backend » History » Version 1
Jessica Mack, 06/22/2015 02:06 AM
1 | 1 | Jessica Mack | h1. Osd - new keyvalue backend |
---|---|---|---|
2 | 1 | Jessica Mack | |
3 | 1 | Jessica Mack | h3. Summary |
4 | 1 | Jessica Mack | |
5 | 1 | Jessica Mack | Create generic, reusable components to back a ceph-osd with a key/value interface |
6 | 1 | Jessica Mack | |
7 | 1 | Jessica Mack | h3. Owners |
8 | 1 | Jessica Mack | |
9 | 1 | Jessica Mack | * Sage Weil (Inktank) |
10 | 1 | Jessica Mack | |
11 | 1 | Jessica Mack | h3. Interested Parties |
12 | 1 | Jessica Mack | |
13 | 1 | Jessica Mack | * Haomai Wang (UnitedStack) |
14 | 1 | Jessica Mack | * Yan, Zheng (Intel) |
15 | 1 | Jessica Mack | * Jiangang, Duan (Intel) |
16 | 1 | Jessica Mack | * Anip Patel (Arizona state University(student)) |
17 | 1 | Jessica Mack | * Andrey Korolyov (Flops) |
18 | 1 | Jessica Mack | |
19 | 1 | Jessica Mack | h3. Current Status |
20 | 1 | Jessica Mack | |
21 | 1 | Jessica Mack | A bunch of new key/value interfaces are emerging, including |
22 | 1 | Jessica Mack | * seagate kinetic https://github.com/Seagate/Kinetic-Preview |
23 | 1 | Jessica Mack | * fusionio NVMKV https://github.com/opennvm/nvmkv |
24 | 1 | Jessica Mack | * various proprietary interfaces (mostly from flash vendors) |
25 | 1 | Jessica Mack | |
26 | 1 | Jessica Mack | New storage hardware, including shingled drives and flash, will behave *much* better using these emerging interfaces. |
27 | 1 | Jessica Mack | The KeyValueDB interface abstracts individual key/value interfaces. In includes a transaction primitive. Currently the only implementation uses leveldb. Alternatives we should consider include |
28 | 1 | Jessica Mack | * RocksDB (https://github.com/facebook/rocksdb) |
29 | 1 | Jessica Mack | |
30 | 1 | Jessica Mack | The DBObjectMap interface builds a simple tree structure on top of KeyValueDB that hides some of the namespace complexity (e.g. omap keys vs xattrs keys) and includes a header/intermediate node that allows clone() to happen efficiently. |
31 | 1 | Jessica Mack | Haomai has a prototype ObjectStore implementation that uses leveldb on the backend, but it is not quite functional yet, and does not support operations like clone. |
32 | 1 | Jessica Mack | |
33 | 1 | Jessica Mack | h3. Detailed Description |
34 | 1 | Jessica Mack | |
35 | 1 | Jessica Mack | Build an ObjectStore implementation that builds on DBObjectMap (and KeyValueDB) to store everything. There will be no direct filesystem interaction. |
36 | 1 | Jessica Mack | Journaling: |
37 | 1 | Jessica Mack | * Since KeyValueDB has a transaction primitive, do not include a journal at all |
38 | 1 | Jessica Mack | * We may want to add this capabability later so that (say) NVRAM can mask the latency of a slow k/v backend, but for now let's ignore it for simplity. |
39 | 1 | Jessica Mack | * FileJournal (and/or the abstract Journal) should be reusable. |
40 | 1 | Jessica Mack | * JournalingObjectStore probably is not reuable; but that's ok |
41 | 1 | Jessica Mack | |
42 | 1 | Jessica Mack | Note on transactions: |
43 | 1 | Jessica Mack | * For now, I suggest we assume we can build on the existing KeyValueDB interface, which includes transactions. |
44 | 1 | Jessica Mack | ** leveldb has transactions |
45 | 1 | Jessica Mack | ** KVMKV has batch_put, which is limited to 64 keyes |
46 | 1 | Jessica Mack | ** kinetic has no transactions |
47 | 1 | Jessica Mack | * To get the atomicity we need, there are a range of tricks available: |
48 | 1 | Jessica Mack | ** Write data to new keys, update a sentinal/root key at the end |
49 | 1 | Jessica Mack | ** Write data to new keys, batch-rename them into place (if backend allows such a thing) |
50 | 1 | Jessica Mack | ** Intent logs |
51 | 1 | Jessica Mack | ** Write-ahead transaction journaling (when necessary) |
52 | 1 | Jessica Mack | * I am not sure whether we can efficiently hide a 'transaction layer' beneath the KeyValueDB interface, but for now let's assume we will be able to. |
53 | 1 | Jessica Mack | |
54 | 1 | Jessica Mack | h3. Work items |
55 | 1 | Jessica Mack | |
56 | 1 | Jessica Mack | h4. Coding tasks |
57 | 1 | Jessica Mack | |
58 | 1 | Jessica Mack | # refactor OSD awareness of FileStore to make the ObjectStore backend configurable |
59 | 1 | Jessica Mack | ## use a generic method to get an ObjectStore implementation by type |
60 | 1 | Jessica Mack | ## push any FileJournal and FileStore references out of osd/* |
61 | 1 | Jessica Mack | # DBObjectMap: refactor interface |
62 | 1 | Jessica Mack | ## expose underlying KeyValueDB transactions to caller, so they can bundle several DBObjectMap ops together and capture an entire ObjectStore::Transaction's worth of work) |
63 | 1 | Jessica Mack | ## expose the user prefixes in a generic way, instead of hard-coding in the omap, xattr, and various internal namespaces |
64 | 1 | Jessica Mack | # stripe file data over keys |
65 | 1 | Jessica Mack | ## Build a class that will implement a file data interface (read extent, write extent, truncate, zero, etc.) on top of DBObjectMap |
66 | 1 | Jessica Mack | ## stripe data over keys of size X (e.g., 1MB, which seems to be the limit people are converging around) |
67 | 1 | Jessica Mack | ## store file size information in a metadata key. maybe this can be DBObjectMap::Header; maybe not |
68 | 1 | Jessica Mack | ## contemplate future optimizations that put small objects "inline" in the Header (or equivalent) key |
69 | 1 | Jessica Mack | # build a KeyValueDB implementation based on the new Kinetic API |
70 | 1 | Jessica Mack | ## initially, we can just ignore transactions |
71 | 1 | Jessica Mack | # build a KeyValueDB implementation based on the NVMKV API |
72 | 1 | Jessica Mack | ## opportunistically use batch_put, but otherwise ignore large transaction atomicity |
73 | 1 | Jessica Mack | # build a KeyValueDB implementations based on RocksDB |
74 | 1 | Jessica Mack | ## allow omap location to be configured independently of osd data path? need to consider commit sequence. :/ |