RGW Geo-Replication and Disaster Recovery » History » Version 1
Jessica Mack, 06/09/2015 07:32 AM
1 | 1 | Jessica Mack | h1. RGW Geo-Replication and Disaster Recovery |
---|---|---|---|
2 | |||
3 | h3. Summary |
||
4 | |||
5 | Currently all Ceph data replication is synchronous, which means that it must be performed over high-speed/low latency links. This makes WAN scale replication impractical. There are at least two pressing reasons for wanting WAN scale replication: |
||
6 | |||
7 | 1. Disaster Recovery |
||
8 | |||
9 | Regional disasters have the potential to destroy an entire facility, or take it off line for a prolonged period of time. If data is to remain safe and availabile in the face of such events, that data must be replicated to another location. |
||
10 | |||
11 | 2. Different Geographical Locations |
||
12 | |||
13 | Geographically distributed teams and companies are increasingly common. There are price, performance, convenience, and availability reasons to try to serve each team from local file servers. Work done by a team in one location is often shared with teams in other locations, who would like to be able to access that data from their local file servers. |
||
14 | |||
15 | This blueprint describes these features and their implementation. |
||
16 | |||
17 | h3. Owners |
||
18 | |||
19 | * Yehuda Sadeh (Inktank) |
||
20 | |||
21 | h3. Interested Parties |
||
22 | |||
23 | * Greg Farnum |
||
24 | * Sage Weil |
||
25 | * Loic Dachary |
||
26 | * Christophe Courtaut christophe.courtaut@gmail.com |
||
27 | * Florian Haas |
||
28 | * Daniele Stroppa (ZHAW) |
||
29 | |||
30 | h3. Current Status |
||
31 | |||
32 | This feature is currently slated for the Dumpling release and implementation is currently underway, but additional assistance (to improve the schedule, provide more functionality, and reduce schedule risk) is welcome. |
||
33 | |||
34 | h3. Detailed Description |
||
35 | |||
36 | In a geographically distributed object storage system, sites will be organized into _regions_, and _zones_. |
||
37 | * regions are large, distinct geographic areas. A region is made up of multiple zones. |
||
38 | ** a particular bucket is created and replicated only within a single region |
||
39 | ** user metadata is replicated across all regions |
||
40 | * zones are geographically separated sites, sufficiently independent that they are unlikely to be affected by a single disaster. |
||
41 | ** a bucket can be replicated to multiple zones within that region |
||
42 | ** each bucket has (at any given time) a designated master-zone, from which that bucket can be written |
||
43 | ** all other (backup zones) have read-only access to that bucket ... but the master zone for a bucket can be changed at any time. |
||
44 | ** the master/backup designation applies to particular buckets. A zone that is a backup for one set of buckets can be master for others. |
||
45 | |||
46 | The basic replication model is: |
||
47 | * master zones maintain logs of both user-metadata and bucket data updates |
||
48 | * remote sites can use (new) RESTful APIs to get information about recent updates |
||
49 | * backup-zone replication agents will use these APIs to track changes in master-zones, pull the updated information, and replay those same changes locally. |
||
50 | |||
51 | This is mechanism provides _eventual consistency_. Backup zones will eventually see all master zone updates, but there the delay between master-zone operations and backup-zone replay means that clients in the backup-zones will sometimes see old data. But there are many benefits for asynchronous, eventual-consistency, pull replication: |
||
52 | * it is highly robust in the face of link and site failures |
||
53 | * it does not force master-zone updates to wait for backup-zones to acknowledge (or catch up with) changes |
||
54 | * it can support arbitrary numbers of replicas |
||
55 | * it can support the creation of new mirrors at any time (long after the original data creation) |
||
56 | * it can be done very efficiently (compressing out multiple updates to the same object) |
||
57 | * while there is a replication delay, it can easily be tuned to be anywhere from seconds to years |
||
58 | |||
59 | (Expanded technical description can be found in the "design proposal":http://www.spinics.net/lists/ceph-devel/msg11905.html originally circulated on ceph-devel) |
||
60 | |||
61 | h3. Work items |
||
62 | |||
63 |