Project

General

Profile

Actions

Feature #15910

closed

Increase the default value of mon_osd_min_in_ratio

Added by David Zafman almost 8 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Why don't we default the mon_osd_min_in_ratio value to something larger than 30%? I would suggest somewhere around 70% - 85%.

If there are networking problems, it is just going to wreak havoc when too many OSDs are marked out. Data movement and running out of space on the remaining OSDs are just some examples.

Upon installation of a Ceph cluster the value for this config should be carefully considered. For example, a 3 rack installation may be able to tolerate 1 rack getting disconnected, but at that point further OSD failures should be ignored. This might mean that a value of 66% would be appropriate for this value.

Another idea would be to have OSDMonitor::can_mark_out() consider fullness of the cluster as another criteria. If there are full OSDs or some other calculation based on losing a particular OSD, it could` return false even if above the mon_osd_min_in_ratio.

The general problem of running out of space due to OSD failures has been around for a long time.


Related issues 2 (0 open2 closed)

Related to Ceph - Feature #2911: osd: Restrict recovery when the OSD full list is nonemptyDuplicate08/06/2012

Actions
Related to Ceph - Bug #15912: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passedResolvedDavid Zafman05/17/2016

Actions
Actions #1

Updated by David Zafman almost 8 years ago

  • Related to Feature #2911: osd: Restrict recovery when the OSD full list is nonempty added
Actions #2

Updated by David Zafman almost 8 years ago

  • Assignee set to David Zafman
Actions #3

Updated by David Zafman about 7 years ago

  • Related to Bug #15912: An OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed added
Actions #4

Updated by David Zafman about 7 years ago

  • Status changed from New to Resolved

Fix included in fix for 15912

https://github.com/ceph/ceph/pull/13425

Actions

Also available in: Atom PDF