Project

General

Profile

Actions

Rgw new multisite configuration » History » Revision 1

Revision 1/11 | Next »
Yehuda Sadeh, 06/16/2015 08:24 PM


RGW NEW MULTISITE CONFIG

Summary
As part of the new multi site scheme, we change the way the system is configured.

Owners
Orit Wasserman (Red Hat)
Yehuda Sadeh (Red Hat)
Name

Interested Parties
If you are interested in contributing to this blueprint, or want to be a "speaker" during the Summit session, list your name here.
Name (Affiliation)
Name (Affiliation)
Name

Current Status
Worked on the design, initial implementation work.

Detailed Description

Definitions

A zone is a collection of pools and radosgw’s in the same cluster that serve the same data.

A zone group is a collection of zones that replicate to or from each other. The zones may (or may not) span different clusters.

A zone realm is a collection of zonegroups that share the same user and bucket namespace.
A period is a period of time during which a given zonegroup configuration is in effect. Each period references the previous period that preceded it and will record basic metadata like the start time. During each period there may be changes to the zone and zonegroup maps; each of these changes will increment the period epoch.

1. Configuration Changes

1.1 Zonegroup map

The zonegroup map holds a map of the entire system, and certain configurables for the different realms, zonegroups and zones. It holds the relationships between the different zonegroups and other configuration:

For the period
- which period preceded it
- the version vector for the previous period’s metadata log
- list of zonegroups
- which zonegroup is the new metadata master
- which zones belong to each zonegroup
- which zone is master for each zonegroup

For the realm
- which zonegroup is the master for user/bucket metadata
- list of zonegroups in the realm
- current period

For each zonegroup
- access url[s] (for control/replication API)
- existing storage policies
- which zone is master for metadata
- which zone(s) are master or slave for data
- list of zones

For each zone
- id, name
- access url[s]
- peers

There will be one zone that will be designated as the master in the master zonegroup, and will manage all user and bucket creation (metadata) and control of the zonegroup map. In order to make a change to the system configuration, a command will be sent to the url of this master and the new configuration will propagate to the rest of the system.
rgw will be able to handle dynamic changes to the zonegroup and zone configuration.

zonegroup map will have a version epoch that will increment after every change.

.rgw.root
default_realm -> $realm
realm.$realm  -> current period
period.$realm.$uuid.$epoch -> period object
period.$realm.$uuid -> latest $epoch
zone.$zone -> $realm

multi site todo

period and realm data structures
APIs for pushing and pulling zone metadata
rgw needs to do watch/notify or poll on the realm.$realm object, restart as needed
gracefully drain requests on old backend instance; startup new one on epoch or period change

metadata sync todo

user instance
save version for every metadata object
log versions for every metadata object
log metadata about which period we are on, which objects are dirty/stale, rollback/rollforward state

1.2 Defining a new zonegroup

Currently, in order to define a new zonegroup, we need to inject a json that holds the zonegroup configuration, then we need to update the zonegroupmap, and then we need to distribute that zonegroupmap into all existing zonegroups and restart all rgws for that to take effect. I don't think this is a good scheme.

A zonegroup will have a zonegroup id, and a zonegroup name. For backward compatibility, older zonegroups will have their zonegroup_id equal to their name.

When setting up a new zonegroup, we'll need to specify an entry point for the 'master' zonegroup. That zonegroup will be in control of the zonegroupmap, and it will distribute the zonegroupmap updates to all zones.

If the zonegroup that we set up is the first zonegroup, we'll need to specify it in the command line. We won't be able to set up a secondary zonegroups if the master has not been specified.

1.3. Defining a new zone

Currently, when running an rgw it does the following:

Read the rgw_zone configurable, check the root pool for the configuration of this zone. If rgw_zone is not defined it will read the default zone name out of the
it will create the 'default' zone, and assign it as the default.

Once a zone name has been set, it cannot really be changed. The zone names are embedded in the rados object names that are created to hold the actual rgw objects.

In order to support zone renaming, and more dynamic configuration we should create a logical 'zone id' that the zone name will point at. The zone id will be a string. When creating a new zone it will be auto generated, and will not be modified. For backward compatibility, older zones will have a zone_id that will match their zone name.

To set up a new zone, the rgw command will include the url to the master zonegroup, and keys to access it. It will also include the name of the zonegroup this zone should reside in. If this zonegroup does not exist, it will be created (if appropriate param was passed in). The master zonegroup will create a new system user for this specific zone, and will send it back.

When a new zone starts up, we'll auto-create all the rados pools that it will use. It will first need to determine whether pools already exist, and are already assigned to a different zone. The naming scheme for the pools would be something like:

.{zone_id}-x-{pool-name}
rgw.$zoneid.$pool
.rgw - bucket -> bucketid metadata
.users - user index
.users.swift
.users.uid
.control - contains notify object
.log - metadata log, which-buckets-have-changed log
.gc - garbage collection
.usage - sharded usage stats
.bucket-index
.bucket-data - bucket data
.bucket-data-nonec - non-ec bucket data

We want to allow the same gateway to be part of multiple zones, this will give us much more flexibility. Different zones will have different ports.

1.4. Dynamic zonegroup and zone changes

rgw will be able to identify changes to the zonegroupmap, and to the zone configuration. This will be done by the following:

rgw will be able to restart itself with a new rados backend handler (RGWRados) after detecting that a configuration change has been made. It will finish handling existing requests, but restart all the frontend handlers with the new RGWRados config.
rgw will set a specific watch/notify handler that will be used to getting updates about the zonegroupmap configuration.
Upon receiving a change, the master zonegroup zone will send a message to all the different zonegroups about the new configuration change.

Any synchronization activity will be dynamically re-set according to the new configuration.

1.5. New RESTful apis

Get period information

GET /admin/realm/period?[period-id=<period-id>][&epoch=<epoch>]

period-id: optional
epoch: optional

Output:

A JSON representation of the current period, or the specified period

1.5.2 Request children to fetch period:

POST /admin/realm/period?[period-id=<period-id>][&epoch=<epoch>]

Input:

period-id: optional
epoch: optional

A JSON representation of the current period, or the specified period

1.5.3. Initialize new zone

Will be sent by the config utility (probably radosgw-admin) to the master zonegroup.

POST /admin/zonegroup?init-zone

Input:

a JSON representation of the following:

  • zonegroup name
  • zone name
  • list of peers (zone ids)

Output:

a JSON representation of the following:
  • metadata of user to be used by zone
  • new zonegroup map

1.5.4. Notify of zonegroup map change

POST /admin/zonegroup?reconfigure

Input:

- new zonegroup map
h2. 1.6. New radosgw-admin, radosgw interfaces:

1.6.1 period

$ radosgw-admin period prepare --parent=<parent> <uuid>

Creating a new period object in .rgw.root pool.
$ radosgw-admin period activate <uuid>

Switch to a new period.
must be a child of the current period
The admin need to reconfigure all the gateways, at first the gateway will need to be restarted to use the new period. In the future they support dynamic configuration.
$ radosgw-admin period pull

pull latest period map from current period master
requires that radosgw-admin uses RESTful api
$ radosgw-admin period pull <remote> <uuid> [--url=<url>]

url: optionally provide remote entry point
Fetch info about a specific remote period
$ radosgw-admin period push  

Ask all children to pull latest epoch

We need to create a mechanism to allow the admin to communicate with other gateways.

1.6.2 zone realm

$ radosgw-admin realm create  --realm=<name>

Create a new zone realm, implicitly creates the first period
$ radosgw-admin realm remove  --realm=<name>  [--realm-id=<id>]  --zonegroup=<name>

Remove a zonegroup from a realm
$ radosgw-admin realm delete --realm=<name>

Delete a realm, needs to be empty
$ radosgw-admin realm rename --realm=<old name> --new-realm-name =<new name> [--realm_id=<id>]

rename a realm.
$ radosgw-admin realm set-default --realm=dho

set realm as the default realm
$ radosgw-admin realm get  --realm=<name> | --realm-id=<id>

Get realm information

1.6.3 zonegroup

$ radosgw-admin zonegroup create --zonegroup=<name> [ --zonegroup-id=<id>]  [--master | --master-url=<url> |  --realm=<name>]

When doing a remote command that contacts the master zonegroup, we'll also need to provide a uid, and access key. This can be done by specifying --uid and --access-key on the command line (which is a bit of a security problem), or by setting it in ceph.conf (which is a bit of a pain).

$ radosgw-admin zonegroup delete --zonegroup=<name>  [ --zonegroup-id=<id>] [--master-url=<url>]

Remove a zonegroup, the zonegroup needs to be empty.
$ radosgw-admin zonegroup rename --zonegroup=<old name>  [ --zonegroup-id=<id>] [--master-url=<url>] --zonegroup-new-name=<new name>

Rename a zonegroup.

1.6.4 creating a new zone

$ radosgw-admin zone create --rgw-zone=<zone_name> --zonegroup=<zonegroup_name> --url=<zone url> [--master | --master-url=<url>]

This command will either set the initial master zone for the system, or will create a new zone. It will generate a new random zoneid (uuid).

radosgw will no longer create pools automagically when it starts up. Zone creation will always be an explicit step by the admin.

1.6.5 Modifying zone configuration:

- Connect zone to another peer (meaning these two zones will sync to/from each other)

$ radosgw-admin zone connect [--rgw-zone=<zone name>] [--zone-id=<zone id>] --peer-zone-id=<peer id> | --peer-zone=<peer name>

- Disconnect zone from another peer

$ radosgw-admin zone disconnect [--rgw-zone=<zone name>] [--zone-id=<zone id>] --peer-zone-id=<peer id> --peer-zone=<peer name>

- Configure a zone placement target (storage policy)

$ radosgw-admin placement modify --placement-target=<name> --zone-id=<id> ... (TBD what exactly)

- Check zone sync status:

$ radosgw-admin zone sync status [--rgw-zone=<zone name>]

Will provide current markers and timestamps for specified zone.

1.6.6 removing a zone from a zonegroup

$ radosgw-admin zone remove --rgw-zone=<zone_name> [--zone-id=<zone id>] --zonegroup=<zonegroup_name>

1.6.6 delete a zone

$ radosgw-admin zone delete--rgw-zone=<zone_name> [--zone-id=<zone id>]

Remove the zone from the system, the zone will be removed from all the zonegroups

1.6.6 rename a zone

$ radosgw-admin zone rename--rgw-zone=<zone_name> [--zone-id=<zone id>] --zone-new-name=<new name>

1.7 single standalone zone

$ radosgw-admin zone create --zone=foo
  rgw.foo.{users,buckets,...}
$ radosgw --zone=foo

We allow a zone to run without adding to a realm and zonegroup.
We can allow adding a zone with data to a realm only if it is the first zone added.

1.7.1 create a replica

option 1:

B$ radosgw-admin zone create --zone=foo-backup
B$ radosgw --zone=foo-backup

A$ radosgw-admin realm create --realm=dho   # implicitly creates an initial period
A$ radosgw-admin zonegroup create --realm=dho --zonegroup=us-west
A$ radosgw-admin zonegroup add --zonegroup us-west --zone=foo --cluster-uuid=blah 
-> these all change the period metadata .. no effect on radosgw yet!
A$ radosgw-admin zone join --zone=foo --realm=dho
A$ killall -1 radosgw
-> now radosgw knows it belongs to a realm and is watching the period
A$ radosgw-admin zonegroup add --zonegroup us-west --zone=foo-backup --cluster-uuid=blah
A$ radosgw-admin period show
-> period references foo-backup, but foo-backup is still ignorant of all this

B$ radosgw-admin period pull http://cluster-a [perioduuid]
B$ radosgw-admin zone join --zone=foo-backup
B$ killall -1 radosgw

option 2:

A$ radosgw-admin realm create --realm=dho --default
A$ radosgw-admin zonegroup create --zonegroup=us-west
A$ radosgw-admin zone create --zone=us-west-1 --realm=dho
A$ radosgw-admin zonegroup add --zonegroup=us-west --realm=dho --zone=us-west-1
A$ radosgw --zone=us-west-1
B$ radosgw-admin pull http://…. --realm=dho   # now B knows the realm exists
B$ radosgw-admin realm set-default --realm=dho
B$ radosgw-admin zone create --zone=us-west-2 --realm=dho
B$ radosgw --zone=us-west-2
A$ radosgw-admin zonegroup add --zone=us-west-2
B$ radosgw-admin pull               # now B knows the new zone is part of the ZG

Bootstrap:


A$ radosgw-admin realm bootstrap-master --realm=dho --zonegroup=us-west --zone=us-west-1
A$ radosgw --zone=us-west-1
B$ radosgw-admin realm bootstrap --realm=dho --realm-endpoint=http://
B$ radosgw-admin zone bootstrap --realm=dho --zonegroup=us-west --zone=us-west-2 
B$ radosgw --zone=us-west-2
B$ radosgw-admin zone bootstrap --realm=dho --zonegroup=us-west --zone=us-west-3
B$ radosgw --zone=us-west-3

1.7. A usage example. Setting up two zonegroups, with two zones in each:

Zonegroup: us-west

Zone: us-west-1 (ceph cluster 1)
- url: http://us-west-1.example.com
Zone: us-west-2 (ceph cluster 2)
- url: http://us-west-2.example.com

Zonegroup: us-east

Zone: us-east-1 (ceph cluster 2)
- url: http://us-east-1.example.com
Zone: us-east-2 (ceph cluster 3)
- url: http://us-east-2.example.com
- In ceph cluster 1:
$ radosgw-admin zonegroup create --zonegroup=us-west --master --url=http://us-west-1.example.com
$ radosgw-admin zone create --rgw-zone=us-west-1 --zonegroup=us-west --url=http://us-west-1.example.com
$ radosgw --rgw-zone=us-west-1
- In ceph cluster 2:
$ radosgw-admin zone init --rgw-zone=us-west-2 --zonegroup=us-west --url=http://us-west-2.example.com --master-url=http://us-west-1.example.com
$ radosgw --rgw-zone=us-west-2
$ radosgw-admin zonegroup init --zonegroup=us-east --url=http://us-east-1.example.com --master-url=http://us-west-1.example.com
$ radosgw-admin zone init --rgw-zone=us-east-1 --zonegroup=us-east --url=http://us-east-1.example.com --master-url=http://us-west-1.example.com
$ radosgw --rgw-zone=us-east-1
- in ceph cluster 3:
$ radosgw-admin zone init --rgw-zone=us-east-1 --zonegroup=us-east --url=http://us-east-2.example.com --master-url=http://us-west-1.example.com
$ radosgw --rgw-zone=us-east-2

Note that these commands don't include the access keys to access the master zone. This will also need to be set, either through the command line, or via ceph.conf.

1.8. Optional simplification:
Instead of creating a zone and running radosgw, we can do it in one step via radosgw itself, e.g.:

 $ radosgw --rgw-zone=us-west-1 --zonegroup=us-west --init-zone --url=http://us-west-1.example.com

We can do the same for the zonegroup creation, so that every zone + zonegroup creation can be squashed to a single radosgw command.

Work items

Coding tasks
Task 1
Task 2
Task 3

Build / release tasks
Task 1
Task 2
Task 3

Documentation tasks
Task 1
Task 2
Task 3

Deprecation tasks
Task 1
Task 2
Task 3

Updated by Yehuda Sadeh almost 9 years ago · 1 revisions