Libradosobjecter - smarter localized reads


Allow reads from object replicas on the same host, same rack, same data center, or other distance metric.


  • Name (Affiliation)

Interested Parties

Current Status

There is currently a librados/libcephfs/Objecter flag that tells the client to read from a non-primary replica if it happens to be on the same host (as indicated by a matching IP address). This works in some situations but is otherwise pretty limited.

Detailed Description

Clusters typically have some sort of hierarchical structure, and we conveniently have that handy in the CRUSH map. We want to use that information to choose a replica that is closes to us.
The key missing piece of information is where the client is in a form that can be reconciled with the CRUSH map. If the client has a set of key/value pairs indicating its location in the same terms that and OSDs location is reflected by CRUSH, we can use that to choose the one that is closest to us.
The 'crush location' config option would be a simple list of key/value pairs, e.g.

crush location = host=foo rack=bar room=baz

We might allow multiple locations to be listed:

crush location = host=foo host=foo2 rack=bar

which would be useful when there are parallel hierarchies and we want to indicate locality for both.
Objecter would look for the match with the CRUSH hierarchy with the lowest-valued crush type (i.e., a matching host is closer than a matching rack)
This would be triggered by the existing LOCALIZED_READS flag implemented in the Objecter and exposed to varying degrees by librados and libcephfs.

Work items

Coding tasks

  1. Objecter: add 'crush location' config option and parse on init
  2. Objecter: add additonal API calls to adjust this location setting at runtime
  3. Objecter: choose the closest replica based on location information, when it is specified. This can either supplement or (more likely) replace the current check for a matching IP address.
  4. librados, libcephfs: expose explicit API to set the location. This would supplement simply setting the 'crush location = ...' config option
  5. [maybe] Update hadoop bindings to use the new API
  6. [maybe] librbd: set localized reads flag on clone parents?

Build / release tasks

  1. Build/expand localized-reads test to verify the correct replica is chosen

Documentation tasks

  1. Document API changes

Tracker Links