Project

General

Profile

Actions

Bug #3716

closed

recovery should take osd usage into account

Added by Corin Langosch over 11 years ago. Updated over 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Using argonaut 0.48.2. Yesterday one osd crashed (disk io error) and recovery started as expected. All osds had an usage below 80% at this time. After some time suddenly my cluster stopped working and I noticed it reported HEALTH_ERR because one osd was full (96%). Due to the help of sjust (thanks!) I was able to bring back the cluster to a working state (imo ceph is really great at handling error conditions and recovery), but here are some observations which I'd like to bring some interest/ discussion too:

1. When an osd fails the host keeps its weight as it was. So when the host has two osds each having a weight of 0.5, and one osd fails, the host still has a weight of 1.0. This seems to give the remaining osd a weight of 1, overloading it quite fast. Probably the weight of the host should automatically be calculated from only the active osds in it?

2. During recovery/ rebalacing it can happen that an osd receives lots of new data before data that should be moved to other nodes is removed. This can (and did for me) result in overloading an osd which was only filled around 60% before rebalancing started. I think ceph should take the disk usage of the osds into account when deciding in which order the rebalancing of the objects should happen (move data from full osds away first and only then copy new data to them).

Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to Closed

#1: this is a matter of adjusting the crush tunables. see http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables

#2: let's open a separate feature for this.

Actions #2

Updated by Corin Langosch over 11 years ago

1. My cluster already uses the tuned crushmap "crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new". So what Im described should not happen?

2. Do your or shall I do it?

Actions

Also available in: Atom PDF