Project

General

Profile

Actions

Bug #4283

closed

ceph weight of host not recalculated after taking osd out

Added by Corin Langosch about 11 years ago. Updated about 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Today I experienced an osd failure and marked that osd (osd.1) out. It was a big osd so had a weight of 2. Another smaller disk on this host (r16439) only has a weight of 0.08. Now after removing osd.1 the weight of the host stayed at 2.08. This quickly caused the second os too grow from 40% usage to 80%, at which point I marked it out too.

dumped osdmap tree epoch 11753
  1. id weight type name up/down reweight
    -1 10.5349 pool default
    -3 10.5349 rack rack1
    -2 1.39999 host r15714
    0 0.699997 osd.0 up 1
    4 0.699997 osd.4 up 1
    -4 2.31999 host r15717
    7 0.319992 osd.7 up 1
    9 2 osd.9 up 1
    -5 1.39999 host r15791
    2 0.699997 osd.2 up 1
    5 0.699997 osd.5 up 1
    -6 0.639999 host r15836
    3 0.319992 osd.3 up 1
    6 0.319992 osd.6 up 1
    -7 2.07999 host r16439
    8 0.0799866 osd.8 up 1
    1 2 osd.1 up 0
    -8 2.215 host r16440
    12 0.0749969 osd.12 up 1
    13 0.139999 osd.13 up 1
    10 2 osd.10 up 1
    -9 0.479965 host r16441
    14 0.0799866 osd.14 up 1
    15 0.319992 osd.15 up 1
    16 0.0799866 osd.16 up 1

I'd really like to suggest to recaculate the weight of the host when the weight of its osd changes. Otherwise osds on the same host can easily get overloaded and causing the whole cluster to hang.

Actions #1

Updated by Sage Weil about 11 years ago

  • Status changed from New to Closed

This is a known problem with the old (and unfortuantely still default) CRUSH behavior. You can fix this with a command like

ceph osd crush tunables optimal

but be warned that a bunch of data will move around, and only very recent kernels will understand how to find the data. See http://ceph.com/docs/master/rados/operations/crush-map/#tunables

Note that this will be the default placement for new clusters very soon... probably starting with cuttlefish (~May).

Actions #2

Updated by Corin Langosch about 11 years ago

Thanks for the quick info. Seems like I have to wait for bobtail to use them, right? I'm already using the other three tunables.

Actions #3

Updated by Greg Farnum about 11 years ago

@Sage Weil, the CRUSH tunables work changes when host weights are recalculated? In particular, doesn't this change mean that more data is going to move around more often?

Actions #4

Updated by Sage Weil about 11 years ago

Greg: it doesn't, no... the weights are unchanged. But the behavior he is seeing where data from an out osd is shifted to other osds in the same subtree (host) is a result of the old (non-tuned) CRUSH behavior. It's the main thing the new tunables fix.

Actions #5

Updated by Greg Farnum about 11 years ago

Ah, right. I was thinking we'd have a problem where all new data maps to the small one, but of course as long as the big disk remains out then it'll get a bunch of the bucket-allocated data and then retry from higher up the tree.

Which leads me to ask if we have a problem when removing nodes — I don't think the higher buckets get reweighted automatically. But maybe they do?

Actions

Also available in: Atom PDF