Project

General

Profile

Actions

Feature #7114

open

Hinted recovery

Added by Kyle Bader over 10 years ago. Updated over 10 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

In the case where you are constructing multi-site RADOS object stores, where inter-site connectivity is a premium, it would be advantageous to be able to have remapped placement groups backfill from a OSD located in a defined bucket in the CRUSH hierarchy. Currently, when a placement group is remapped, the new member of the placement group communicates with the primary to determine which objects are missing, which the primary then sends to the target OSD. Having the primary instead provide a hint to the OSD with affinity to the new OSD that it should send objects.

Example:

dc-1
pod-1
access-1
host-1
osd.1
pod-2
access-2
host-2
osd.2

dc-2
pod-3
access-3
host-3
osd.3
pod-4
access-4
host-4
osd.4

dc-3
pod-5
access-5
host-5
osd.5
pod-6
access-6
host-6
osd.6

  • pg 123 is mapped to osd.1, osd.3 and osd.5, osd.1 is primary
  • pod-3 grows by N osds, pg remapped to osd.3 to osd.4

Normal behavior: osd.4 backfills from osd.1 (stresses inter-dc links)

Desired behavior: osd.4 backfills from osd.3 (avoides inter-dc links)

This allows a cluster to be built so that the inter-dc bandwidth requirements are the sum of ingest and recovery.

Actions #1

Updated by Kyle Bader over 10 years ago

Instead of "a defined bucket in the CRUSH hierarchy", it probably makes more sense to say uses the nearest common ancestor in the CRUSH hierarchy.

Actions #2

Updated by Loïc Dachary over 10 years ago

David Zafman is working on multiple backfills to address the case where the primary must send chunks to multiple OSDs when using erasure coding. That does not help with the problem you're describing, I'm just mentionning this for the record.

Actions

Also available in: Atom PDF