Bug #3716
closedrecovery should take osd usage into account
0%
Description
Using argonaut 0.48.2. Yesterday one osd crashed (disk io error) and recovery started as expected. All osds had an usage below 80% at this time. After some time suddenly my cluster stopped working and I noticed it reported HEALTH_ERR because one osd was full (96%). Due to the help of sjust (thanks!) I was able to bring back the cluster to a working state (imo ceph is really great at handling error conditions and recovery), but here are some observations which I'd like to bring some interest/ discussion too:
1. When an osd fails the host keeps its weight as it was. So when the host has two osds each having a weight of 0.5, and one osd fails, the host still has a weight of 1.0. This seems to give the remaining osd a weight of 1, overloading it quite fast. Probably the weight of the host should automatically be calculated from only the active osds in it?
2. During recovery/ rebalacing it can happen that an osd receives lots of new data before data that should be moved to other nodes is removed. This can (and did for me) result in overloading an osd which was only filled around 60% before rebalancing started. I think ceph should take the disk usage of the osds into account when deciding in which order the rebalancing of the objects should happen (move data from full osds away first and only then copy new data to them).
Updated by Sage Weil over 11 years ago
- Status changed from New to Closed
#1: this is a matter of adjusting the crush tunables. see http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
#2: let's open a separate feature for this.
Updated by Corin Langosch over 11 years ago
1. My cluster already uses the tuned crushmap "crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new". So what Im described should not happen?
2. Do your or shall I do it?