RGW Geo-Replication and Disaster Recovery


Currently all Ceph data replication is synchronous, which means that it must be performed over high-speed/low latency links. This makes WAN scale replication impractical. There are at least two pressing reasons for wanting WAN scale replication:

1. Disaster Recovery

Regional disasters have the potential to destroy an entire facility, or take it off line for a prolonged period of time. If data is to remain safe and availabile in the face of such events, that data must be replicated to another location.

2. Different Geographical Locations

Geographically distributed teams and companies are increasingly common. There are price, performance, convenience, and availability reasons to try to serve each team from local file servers. Work done by a team in one location is often shared with teams in other locations, who would like to be able to access that data from their local file servers.

This blueprint describes these features and their implementation.


  • Yehuda Sadeh (Inktank)

Interested Parties

Current Status

This feature is currently slated for the Dumpling release and implementation is currently underway, but additional assistance (to improve the schedule, provide more functionality, and reduce schedule risk) is welcome.

Detailed Description

In a geographically distributed object storage system, sites will be organized into regions, and zones.
  • regions are large, distinct geographic areas. A region is made up of multiple zones.
    • a particular bucket is created and replicated only within a single region
    • user metadata is replicated across all regions
  • zones are geographically separated sites, sufficiently independent that they are unlikely to be affected by a single disaster.
    • a bucket can be replicated to multiple zones within that region
    • each bucket has (at any given time) a designated master-zone, from which that bucket can be written
    • all other (backup zones) have read-only access to that bucket ... but the master zone for a bucket can be changed at any time.
    • the master/backup designation applies to particular buckets. A zone that is a backup for one set of buckets can be master for others.
The basic replication model is:
  • master zones maintain logs of both user-metadata and bucket data updates
  • remote sites can use (new) RESTful APIs to get information about recent updates
  • backup-zone replication agents will use these APIs to track changes in master-zones, pull the updated information, and replay those same changes locally.
This is mechanism provides eventual consistency. Backup zones will eventually see all master zone updates, but there the delay between master-zone operations and backup-zone replay means that clients in the backup-zones will sometimes see old data. But there are many benefits for asynchronous, eventual-consistency, pull replication:
  • it is highly robust in the face of link and site failures
  • it does not force master-zone updates to wait for backup-zones to acknowledge (or catch up with) changes
  • it can support arbitrary numbers of replicas
  • it can support the creation of new mirrors at any time (long after the original data creation)
  • it can be done very efficiently (compressing out multiple updates to the same object)
  • while there is a replication delay, it can easily be tuned to be anywhere from seconds to years

(Expanded technical description can be found in the design proposal originally circulated on ceph-devel)

Work items