Project

General

Profile

RGW NEW MULTISITE SYNC

Summary
We're reworking the way we do multisite synchronization. This includes having active-active model, changes to the metadata synchronization, and sync process that is internal to the radosgw processes.

Owners
Yehuda Sadeh (Red Hat)
Orit Wasserman (Red Hat)
Name

Interested Parties
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
Name (Affiliation)
Name (Affiliation)
Name

Current Status
Worked on the design, initial implementation work.

Detailed Description
Here's the new sync scheme that we discussed. Note that it's very similar to the old scheme, but it adds a push notification. It does not specify how concurrency between multiple workers will be achieved, but there are a few ways to implement that: the same as with the old sync agent (lock shards), have a single elected worker per zone (use watch/notify for election), use watch-notify to sync work, specify workers manually, and potentially other solutions.

Note that this is going to be implemented as part of the gateway, which gives us more flexibility in how to leverage rados to store the sync state. Cross zone communication will still be done using RESTful api.

The idea is to work roughly at the same premise that we've been working before. We'll have 3 logs: metadata log, data log, bucket index log. We'll add push notifications to make changes appear quicker on the destination. The design supports active-active zones, and federated architecture.

  • Multi-zonegroup, multi-zone architecture

There still is only a single zone that is responsible for metadata updates. This zone is called the 'master' zone, and every other zone needs to make metadata changes against it.

Each zonegroup can have multiple zones. Each zone can have multiple peer zones, but not necessarily all the zones within that zonegroup. But it is required that there is a path between all the zones in the zonegroup (a connected graph).

zonegroup:
name
is_master?
master zone
list of zones

zone:
containing zonegroup
list of peers
zone endpoints

Each bucket instance within each zone has a unique incrementing version id that is used to keep track of changes on that specific bucket.

A zone keeps a sync state of where it is synced with regard to all its peers. A zone keeps a metadata sync state against the master zone.

zone_data_sync_status:
state: init, full_sync, incremental
list of bucket instance states

bucket_instance_state:
full_sync (keep start_marker+position) | incremental (keep position)
list of object retries

The idea is that if we're doing a full sync of the bucket, we need to keep the source zone bucket index position, so that later on we'll catch all changes that went in since we started full syncing this bucket. We also keep the position of where we are at the full sync (what object we last synced). Also, before starting the full sync, we need to keep the state in the data (changed buckets) log.

When we're at the incremental stage, we need to keep the bucket index position. We follow the data log and sync each bucket instance that changed there.
Also, for every failed object sync we need to keep a retry entry.

zone data sync stages:
init:
Fetch the list of all the bucket instances and keep them in in a sharded sorted list

sync:
for each bucket
if bucket does not exist, fetch bucket, bucket.instance metadata from master zone
sync bucket

Also, we need to keep a list of all the buckets that have objects that need to be resent

Metadata sync:

Similar to the data sync:

metadata_sync_status:
state: init, full_sync, incremental

At the init state: keep the position of the metadata log. List all the metadata entries that exist and keep them in a sharded sorted list.
Full sync: for each entry in list, sync (fetch and store).
Incremental: follow changes in metadata log, store changes

Status inspection:

Provide the status of each zone, as a difference with regard to its peers (e.g., mtime of oldest non-synced change)

Push notifications:

A zone will send changes as they happen to all its connected peers. It will either send it as a change by change, or accumulate a few changes for a period of time and then send. These are just hints for the peers so that they could get the changes quicker, but if these are missed they will be picked up by the zones later through their regular sync process. The notifications will be done using a POST request between the source zone and the destination zone.

  • Active-active considerations

Each change has a 'source zone' assigned to it.
A change will not be applied if the dest zone's version mtime is greater or equal
- we should keep a higher precision mtime as an object attribute, the stat() mtime only uses seconds, problematic

Work items

Coding tasks
Task 1
Task 2
Task 3

Build / release tasks
Task 1
Task 2
Task 3

Documentation tasks
Task 1
Task 2
Task 3

Deprecation tasks
Task 1
Task 2
Task 3