Project

General

Profile

Rgw - bucket index scalability

Summary

Currently the bucket index info is kept in a single object that may serve as a scalability pain point, as the update operation on a single rados object is not scalable.
Another problem with one single big bucket, is that when backfill/recovery are happening to the bucket index object, all updates to that object (which is at the critical path of a put request) are stalled until the object is fully recovered. Per our testing, when there are 3 million objects within a bucket, it takes around 2 minutes to recover thus most requests during this period get timeout.

Owners

  • Guang Yang (Yahoo!)
  • Name (Affiliation)
  • Name

Interested Parties

  • Yehuda Sadeh (Inktank)
  • Anip Patel (Arizona State University(student))
  • Guang Yang (Yahoo!)
  • Name

Current Status

Currently, there's a single bucket index per bucket. This is mostly an issue when writing objects, as most read operations don't access the bucket index.
The bucket index is needed for listing objects. With the new DR development, the bucket index also holds the bucket index log, that tracks objects that were modified.;
One of the assumptions that we'd want to look at is whether there are cases that can benefit from bucket index elimination.

Detailed Description

Two main approaches that are not necessarily contradicting: one is to shard bucket index data, second one is to completely eliminate the bucket index ("blind buckets").
- Bucket sharding
Each bucket index info will be split into multiple shards. There will be a main bucket index object that will hold the bucket index map. The main question here is how to rebuild / split bucket index dynamically.
We can start with a static sharding with a configured shards for each bucket, and the number of shards is persisted along with the bucket metadata. Some modifications are needed for each type of operation:
  1. bucket creation - create all bucket index object shards in one shot.
  2. PUT/COPY/DELETE object - we use a static hashing to find out the shard of the bucket index object to operate (this static sharding means we don't support changing number of shards dynamically).
  3. bucket listing / stats / fix indexing - we issue AIO requests to all the shards to do a prefix lising and aggregate data (sorted by object id).
  4. Bucket index log - this is the most complicated as it uses an internal monotonic increasing version for each shard, as we move to multiple shards, there is not a global version number available making all shards increasing monotonically, as a result, each shard would need to maintain its own version number, we solve this by response back the bucket index object shard id along with bucket index entries, so that client can aggregate a marker for listing purpose.

- Blind bucket
In certain environment it will be possible to eliminate the bucket index completely. An open question is whether it will be possible to list objects (also certainly will not be as efficient) in these environments. Another question is whether it would be possible to incorporate blind buckets in a DR solution.
Currently we store the number of shards for bucket along with bucket's metadata, so that it makes sense to extend S3 API to let user specify that he/she does not need bucket listing/stats, as a result, we can completely disable bucket indexing.

Pad

http://pad.ceph.com/p/GH-bucket-index-scalability

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3