Project

General

Profile

Actions

Fix #9566

open

osd: prioritize recovery of OSDs with most work to do

Added by Sheldon Mustard over 9 years ago. Updated over 8 years ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Assume 72 hours for host replacement/reprovisioning SLA. When host goes down (hardware failure), we expect complete cluster recovery in ~48+ hours. If we lose one more disk anywhere else during this interval, we lose write access (min_size=2) to a subset of 36 million of objects. Hopefully, much smaller subset. If another disk fails, we lose data permanently. Losing a host and another two disks (out of 576 disks in total) within 48+ hours is a non-zero probability. While we understand that this is an inherent risk with any distributed system, we're not very happy about the fact that the most time spent in recovery is when less than 10% of objects are degraded (very long tail). If we maintained a more or less constant repair rate (for simplicity, let's not account for client/recover throttling), we could've reduced the exposure window from 48 to 12 or less hours.

Note: osd_max_backfill is the default (i.e. 10)

Actions

Also available in: Atom PDF