Project

General

Profile

Rgw - Hadoop FileSystem Interface for a RADOS Gateway Caching Tier » History » Version 1

Jessica Mack, 08/26/2015 02:11 AM

1 1 Jessica Mack
h1. Rgw - Hadoop FileSystem Interface for a RADOS Gateway Caching Tier
2
3
h3. Summary
4
5
We plan to build a reference solution on Hadoop over multiple Ceph RGW with SSD cache, similar to Openstack Sahara project(Hadoop over Swift). In this solution all the storage servers are in a isolated network with the Hadoop cluster. The RGW instances will play as the connectors of these two networks. We'll leverage Ceph Cache Tier technology to cache the data in each RGW servers.
6
7
h3. Owners
8
9
* Yuan Zhou (yuan.zhou@intel.com)
10
* Jian Zhang (jian.zhang@intel.com)
11
* Name
12
13
h3. Interested Parties
14
15
* Name (Affiliation)
16
* Name (Affiliation)
17
* Name
18
19
h3. Current Status
20
21
Currently RGW supports Swift API nicely. However in Sahara project, Swift was specially configured to work with Hadoop: swift-proxy server is working like NameNode to give out the data location and Hadoop job will direct read/write to swift-object server. 
22
Detailed Description
23
We'll need some additional work here:
24
1. RGWFS: Hadoop compatible file system which can talk to RGW instances. Basically this will follow SwiftFS(https://github.com/openstack/sahara-extra/tree/master/hadoop-swiftfs) does.
25
2. RGW-Proxy: A standalone module that would point out the block location in RGW. To achieve data locality(Each MR could read from the RGW instance on the same rack), we'll need to understand the internals of mapping in RGW object to RADOS object, and also the mapping from Cache Tier to Base Tier.
26
a) RGW-Proxy would first get the manifest file from the header object and then get the rest shadow objects' location in RADOS. 
27
b) RGW-Proxy could calculate the re-mapped location in CT using the right crush rule.
28
c) With the location in CT, RGW-Proxy then report out the RGW instances to use for each blocks
29
d) RGWFS(Hadoop job) will issue range read requests to get the blocks through the closest RGW instances(on the same rack)
30
3. RGW over Cache Tier: a RGW deployment over Cache Tier that can use SSD as a cache layer.
31
TODO: Is it able to make header object size(512KB) configurable?
32
33
h3. Work items
34
35
h4. Coding tasks
36
37
# Task 1
38
# Task 2
39
# Task 3
40
41
h4. Build / release tasks
42
43
# Task 1
44
# Task 2
45
# Task 3
46
47
h4. Documentation tasks
48
49
# Task 1
50
# Task 2
51
# Task 3
52
53
h4. Deprecation tasks
54
55
# Task 1
56
# Task 2
57
# Task 3