Feature #7114
openHinted recovery
0%
Description
In the case where you are constructing multi-site RADOS object stores, where inter-site connectivity is a premium, it would be advantageous to be able to have remapped placement groups backfill from a OSD located in a defined bucket in the CRUSH hierarchy. Currently, when a placement group is remapped, the new member of the placement group communicates with the primary to determine which objects are missing, which the primary then sends to the target OSD. Having the primary instead provide a hint to the OSD with affinity to the new OSD that it should send objects.
Example:
dc-1
pod-1
access-1
host-1
osd.1
pod-2
access-2
host-2
osd.2
dc-2
pod-3
access-3
host-3
osd.3
pod-4
access-4
host-4
osd.4
dc-3
pod-5
access-5
host-5
osd.5
pod-6
access-6
host-6
osd.6
- pg 123 is mapped to osd.1, osd.3 and osd.5, osd.1 is primary
- pod-3 grows by N osds, pg remapped to osd.3 to osd.4
Normal behavior: osd.4 backfills from osd.1 (stresses inter-dc links)
Desired behavior: osd.4 backfills from osd.3 (avoides inter-dc links)
This allows a cluster to be built so that the inter-dc bandwidth requirements are the sum of ingest and recovery.
Updated by Kyle Bader over 10 years ago
Instead of "a defined bucket in the CRUSH hierarchy", it probably makes more sense to say uses the nearest common ancestor in the CRUSH hierarchy.
Updated by Loïc Dachary over 10 years ago
David Zafman is working on multiple backfills to address the case where the primary must send chunks to multiple OSDs when using erasure coding. That does not help with the problem you're describing, I'm just mentionning this for the record.