Increase the default value of mon_osd_min_in_ratio
Why don't we default the mon_osd_min_in_ratio value to something larger than 30%? I would suggest somewhere around 70% - 85%.
If there are networking problems, it is just going to wreak havoc when too many OSDs are marked out. Data movement and running out of space on the remaining OSDs are just some examples.
Upon installation of a Ceph cluster the value for this config should be carefully considered. For example, a 3 rack installation may be able to tolerate 1 rack getting disconnected, but at that point further OSD failures should be ignored. This might mean that a value of 66% would be appropriate for this value.
Another idea would be to have OSDMonitor::can_mark_out() consider fullness of the cluster as another criteria. If there are full OSDs or some other calculation based on losing a particular OSD, it could` return false even if above the mon_osd_min_in_ratio.
The general problem of running out of space due to OSD failures has been around for a long time.