Assume 72 hours for host replacement/reprovisioning SLA. When host goes down (hardware failure), we expect complete cluster recovery in ~48+ hours. If we lose one more disk anywhere else during this interval, we lose write access (min_size=2) to a subset of 36 million of objects. Hopefully, much smaller subset. If another disk fails, we lose data permanently. Losing a host and another two disks (out of 576 disks in total) within 48+ hours is a non-zero probability. While we understand that this is an inherent risk with any distributed system, we're not very happy about the fact that the most time spent in recovery is when less than 10% of objects are degraded (very long tail). If we maintained a more or less constant repair rate (for simplicity, let's not account for client/recover throttling), we could've reduced the exposure window from 48 to 12 or less hours.
Note: osd_max_backfill is the default (i.e. 10)