Feature #15910
closedIncrease the default value of mon_osd_min_in_ratio
0%
Description
Why don't we default the mon_osd_min_in_ratio value to something larger than 30%? I would suggest somewhere around 70% - 85%.
If there are networking problems, it is just going to wreak havoc when too many OSDs are marked out. Data movement and running out of space on the remaining OSDs are just some examples.
Upon installation of a Ceph cluster the value for this config should be carefully considered. For example, a 3 rack installation may be able to tolerate 1 rack getting disconnected, but at that point further OSD failures should be ignored. This might mean that a value of 66% would be appropriate for this value.
Another idea would be to have OSDMonitor::can_mark_out() consider fullness of the cluster as another criteria. If there are full OSDs or some other calculation based on losing a particular OSD, it could` return false even if above the mon_osd_min_in_ratio.
The general problem of running out of space due to OSD failures has been around for a long time.
Updated by David Zafman almost 8 years ago
- Related to Feature #2911: osd: Restrict recovery when the OSD full list is nonempty added
Updated by David Zafman about 7 years ago
- Related to Bug #15912: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed added
Updated by David Zafman about 7 years ago
- Status changed from New to Resolved
Fix included in fix for 15912