Project

General

Profile

Bug #50162

Backport to Natilus of automatic lowering min_size for repairing tasks (osd_allow_recovery_below_min_size)

Added by Rainer Krienke almost 3 years ago. Updated almost 3 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently my ceph cluster (Nautilus 14.2.16 with 9 hosts, 16 4TB OSDs on each host, erasure-coding 4+2 profile, redundance defined on host level), over night showed a total of two defect disks on two different hosts.

In the morning when I first looked at the problem I saw that ceph had started rebalancing in the night. The client of this cluster a proxmox PVE system for host virtualisation showed all VMs hanging completely in disk access. So first I let the rebalancing do its work. After several hours rebalancing stopped without beeing healthy.

ceph health detail showed still one PG as inactive with a hint what to do:

HEALTH_WARN Reduced data availability: 1 pg inactive, 1 pg incomplete; 15 daemons have recently crashed; 150 slow ops, oldest one blocked for 26716 sec, daemons [osd.60,osd.67] have slow ops.
PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
pg 36.15b is remapped+incomplete, acting [60,2147483647,23,96,2147483647,36] (reducing pool pxa-ec min_size from 5 may help; search ceph.com/docs for 'incomplete')

I then followed the given advice to reduce min_size of the pool in question from 5 to 4. This helped and after another while of rebalancing ceph finally got "healthy" again.

I learned that actually for repair purposes a default k+1 min_size setting should not stop ceph from repairing if ceph parameter "osd_allow_recovery_below_min_size" was set to true. I checked this parameter and it showed:

  1. ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep osd_allow_recovery_below_min_size
    "osd_allow_recovery_below_min_size": "true",

So actually ceph sould have been able to rebalance completely but was not since this feature probably has has not been backported to Nautilus.

So it would be great to have a backport of this "rebalance automatically lowering min_size for repair" feature for Nautilus because this might help to save data stored inside ceph and avoid a potential risky manual lowering of min_size as it is needed now in such situations like mine.

Thanks a lot
Rainer

History

#1 Updated by Nathan Cutler almost 3 years ago

  • Tracker changed from Backport to Support
  • Target version deleted (v14.2.20)
  • % Done set to 0

This needs a pull request ID, or a list of master commits that are requested to be backported.

Now that Pacific is stable, Nautilus is nearing End Of Life, so you might consider upgrading to Octopus to get this feature.

#2 Updated by Greg Farnum almost 3 years ago

  • Tracker changed from Support to Bug
  • Project changed from Ceph to RADOS
  • Regression set to No
  • Severity set to 3 - minor

#3 Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Won't Fix

Nathan Cutler wrote:

This needs a pull request ID, or a list of master commits that are requested to be backported.

Now that Pacific is stable, Nautilus is nearing End Of Life, so you might consider upgrading to Octopus to get this feature.

I agree, we are not backporting any features to nautilus at this moment.

Also available in: Atom PDF