Bug #2047
crush: with a rack->host->device hierarchy, several down devices are likely to cause bad mappings
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
See http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5166
Sage says the cause is down devices only triggering local retries on the same host.
Related issues
History
#1 Updated by Sage Weil over 11 years ago
- Status changed from New to Duplicate
#2 Updated by Sage Weil over 11 years ago
- Status changed from Duplicate to 12
#3 Updated by Sage Weil over 11 years ago
- Priority changed from Normal to High
#4 Updated by Sage Weil over 11 years ago
fwiw dropping the local search behavior fixes this bad behavior. the question is what probably was the local search originally for, and how do we address it now. the bit i'm worried about is the exhaustive search fall-back... probably we need to have a something like that at the global level?
diff --git a/src/crush/mapper.c b/src/crush/mapper.c index df195c7..77370a9 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -421,13 +421,7 @@ reject: ftotal++; flocal++; - if (collide && flocal < 3) - /* retry locally a few times */ - retry_bucket = 1; - else if (flocal <= in->size + orig_tries) - /* exhaustive bucket search */ - retry_bucket = 1; - else if (ftotal < 20) + if (ftotal < 20) /* then retry descent */ retry_descent = 1; else
#5 Updated by Sage Weil over 11 years ago
- Priority changed from High to Normal
#6 Updated by Sage Weil over 11 years ago
- Status changed from 12 to Resolved
#7 Updated by Greg Farnum over 6 years ago
- Project changed from Ceph to RADOS
- Category deleted (
10)