Project

General

Profile

Bug #3785

ceph: default crush rule does not suit multi-OSD deployments

Added by Ian Colle almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
01/10/2013
Due date:
% Done:

0%

Spent time:
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Version: 0.48.2-0ubuntu2~cloud0

Our Ceph deployments typically involve multiple OSDs per host with no disk redundancy. However the default crush rules appears to distribute by OSD, not by host, which I believe will not prevent replicas from landing on the same host.

I've been working around this by updating the crush rules as follows and installing the resulting crushmap in the cluster, but since we aim for fully automated deployment (using Juju and MaaS) this is suboptimal.

--- crushmap.txt 2013-01-10 20:33:21.265809301 0000
+
+ crushmap.new 2013-01-10 20:32:49.496745778 0000
@ -104,7 +104,7 @
min_size 1
max_size 10
step take default
- step choose firstn 0 type osd
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
@ -113,7 +113,7 @
min_size 1
max_size 10
step take default
- step choose firstn 0 type osd
+ step chooseleaf firstn 0 type host
step emit
}
rule rbd {
@ -122,7 +122,7 @
min_size 1
max_size 10
step take default
- step choose firstn 0 type osd
+ step chooseleaf firstn 0 type host
step emit
}

https://bugs.launchpad.net/cloud-archive/+bug/1098320

Associated revisions

Revision 7ea5d84f (diff)
Added by Sage Weil almost 7 years ago

osdmap: spread replicas across hosts with default crush map

This is more often the case than not, and we don't have a good way to
magically know what size of cluster the user will be creating. Better to
err on the side of doing the right thing for more people.

Fixes: #3785
Signed-off-by: Sage Weil <>
Reviewed-by: Greg Farnum <>

Revision 015a454a (diff)
Added by Sage Weil almost 7 years ago

osdmap: spread replicas across hosts with default crush map

This is more often the case than not, and we don't have a good way to
magically know what size of cluster the user will be creating. Better to
err on the side of doing the right thing for more people.

Fixes: #3785
Signed-off-by: Sage Weil <>
Reviewed-by: Greg Farnum <>
(cherry picked from commit 7ea5d84fa3d0ed3db61eea7eb9fa8dbee53244b6)

Revision c236a51a (diff)
Added by Sage Weil almost 7 years ago

osdmap: make replica separate in default crush map configurable

Add 'osd crush chooseleaf type' option to control what the default
CRUSH rule separates replicas across. Default to 1 (host), and set it
to 0 in vstart.sh.

Fixes: #3785
Signed-off-by: Sage Weil <>
Reviewed-by: Greg Farnum <>

Revision 6008b1d8 (diff)
Added by Sage Weil over 6 years ago

osdmap: make replica separate in default crush map configurable

Add 'osd crush chooseleaf type' option to control what the default
CRUSH rule separates replicas across. Default to 1 (host), and set it
to 0 in vstart.sh.

Fixes: #3785
Signed-off-by: Sage Weil <>
Reviewed-by: Greg Farnum <>
(cherry picked from commit c236a51a8040508ee893e4c64b206e40f9459a62)

History

#1 Updated by Ian Colle almost 7 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to High

#2 Updated by Greg Farnum almost 7 years ago

The issue here is that CRUSH maps which behave well on multi-host deployments behave quite poorly on one or two host deployments. The mkcephfs build path actually does handle this fairly politely, though, and I think (perhaps erroneously) that ceph-deploy is optimized for larger clusters.
Which deployment mechanism are you using?

#3 Updated by Anonymous almost 7 years ago

I agree with Ian, I have seen very bad things happen when crush choses two OSD on one host, rather than distribute to different hosts.

It is nice to know that mkcephfs has a mechanism to balance the load so this won't happen. But this is a scalable product. Customers are suppose to use 'ceph osd add' to add more osd's to the cluster.

does 'ceph osd add' take into consideration crush host balancing when doing an add? Do we have instructions to manually handle that?

I think there should be a default rule that says the data replicas can not be written on the same host as the original. no matter how the OSD has been added.

just my 2cents... :-)

#4 Updated by Anonymous almost 7 years ago

This comment should have been in bug 3789

upping the memory on these VMs from 512M to 2G

since it appears it was a resource problem, i will close this bug.

do we have any mechanism that I am missing that notifies the end user when crashes like this occur? So they can go in and fix their cluster before there are a critical number of resources that have failed?

#5 Updated by Anonymous almost 7 years ago

  • Status changed from New to Won't Fix

This comment should have been in bug 3789

caused by a lack of resources on the system.
have increased the memory from 512M to 2G, will retest.

#6 Updated by Dan Mick almost 7 years ago

I think maybe Deb's comments and closure were meant for another bug (perhaps 3789?)

#7 Updated by Anonymous almost 7 years ago

  • Status changed from Won't Fix to New

dang! wrong bug. opening this one back up.
sorry all!

#8 Updated by Sage Weil almost 7 years ago

  • Status changed from New to Need Review
  • Assignee changed from Sage Weil to Greg Farnum

wip-3785

#9 Updated by Greg Farnum almost 7 years ago

Looks good to me. What branches do we want to cherry-pick it on.

#10 Updated by Sage Weil almost 7 years ago

good question. let's start with bobtail.

#11 Updated by Greg Farnum almost 7 years ago

  • Status changed from Need Review to Resolved

Merged to master in 7ea5d84fa3d0ed3db61eea7eb9fa8dbee53244b6 and cherry-picked to bobtail in commit:503917f0049d297218b1247dc0793980c39195b3.

#12 Updated by Sage Weil almost 7 years ago

  • Status changed from Resolved to Need Review

der, broke vstart. can you review wip-3785?

#13 Updated by Greg Farnum almost 7 years ago

sigh

This also looks good to me, and I like it better (should have suggested this the first time around). But now I've gotten scared again; have you run this outside of vstart? :)

#14 Updated by Sage Weil almost 7 years ago

Nope.. which leads me to realize that that setting needs to go in teuthology's ceph.conf. Doing that now, and then I'll run it through the suite.

#15 Updated by Sage Weil almost 7 years ago

  • Status changed from Need Review to Resolved

commit:f358cb1d2b0a3a78bf59c4fd085906fcb5541bbe

#16 Updated by Greg Farnum almost 7 years ago

I presume we're planning to backport this to bobtail after it passes some nights of testing? Maybe we should leave the bug in "testing" until then (or we get our "Needs Backport" status!).

Also available in: Atom PDF