Bug #3785
closed
ceph: default crush rule does not suit multi-OSD deployments
Added by Ian Colle over 11 years ago.
Updated over 11 years ago.
Description
Version: 0.48.2-0ubuntu2~cloud0
Our Ceph deployments typically involve multiple OSDs per host with no disk redundancy. However the default crush rules appears to distribute by OSD, not by host, which I believe will not prevent replicas from landing on the same host.
I've been working around this by updating the crush rules as follows and installing the resulting crushmap in the cluster, but since we aim for fully automated deployment (using Juju and MaaS) this is suboptimal.
--- crushmap.txt 2013-01-10 20:33:21.265809301 0000
++ crushmap.new 2013-01-10 20:32:49.496745778 0000
@ -104,7 +104,7
@
min_size 1
max_size 10
step take default
- step choose firstn 0 type osd
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
@ -113,7 +113,7
@
min_size 1
max_size 10
step take default
- step choose firstn 0 type osd
+ step chooseleaf firstn 0 type host
step emit
}
rule rbd {
@ -122,7 +122,7
@
min_size 1
max_size 10
step take default
- step choose firstn 0 type osd
+ step chooseleaf firstn 0 type host
step emit
}
https://bugs.launchpad.net/cloud-archive/+bug/1098320
- Assignee set to Sage Weil
- Priority changed from Normal to High
The issue here is that CRUSH maps which behave well on multi-host deployments behave quite poorly on one or two host deployments. The mkcephfs build path actually does handle this fairly politely, though, and I think (perhaps erroneously) that ceph-deploy is optimized for larger clusters.
Which deployment mechanism are you using?
I agree with Ian, I have seen very bad things happen when crush choses two OSD on one host, rather than distribute to different hosts.
It is nice to know that mkcephfs has a mechanism to balance the load so this won't happen. But this is a scalable product. Customers are suppose to use 'ceph osd add' to add more osd's to the cluster.
does 'ceph osd add' take into consideration crush host balancing when doing an add? Do we have instructions to manually handle that?
I think there should be a default rule that says the data replicas can not be written on the same host as the original. no matter how the OSD has been added.
just my 2cents... :-)
This comment should have been in bug 3789
upping the memory on these VMs from 512M to 2G
since it appears it was a resource problem, i will close this bug.
do we have any mechanism that I am missing that notifies the end user when crashes like this occur? So they can go in and fix their cluster before there are a critical number of resources that have failed?
- Status changed from New to Won't Fix
This comment should have been in bug 3789
caused by a lack of resources on the system.
have increased the memory from 512M to 2G, will retest.
I think maybe Deb's comments and closure were meant for another bug (perhaps 3789?)
- Status changed from Won't Fix to New
dang! wrong bug. opening this one back up.
sorry all!
- Status changed from New to Fix Under Review
- Assignee changed from Sage Weil to Greg Farnum
Looks good to me. What branches do we want to cherry-pick it on.
good question. let's start with bobtail.
- Status changed from Fix Under Review to Resolved
- Status changed from Resolved to Fix Under Review
der, broke vstart. can you review wip-3785?
sigh
This also looks good to me, and I like it better (should have suggested this the first time around). But now I've gotten scared again; have you run this outside of vstart? :)
Nope.. which leads me to realize that that setting needs to go in teuthology's ceph.conf. Doing that now, and then I'll run it through the suite.
- Status changed from Fix Under Review to Resolved
commit:f358cb1d2b0a3a78bf59c4fd085906fcb5541bbe
I presume we're planning to backport this to bobtail after it passes some nights of testing? Maybe we should leave the bug in "testing" until then (or we get our "Needs Backport" status!).
Also available in: Atom
PDF