Project

General

Profile

Bug #2047

crush: with a rack->host->device hierarchy, several down devices are likely to cause bad mappings

Added by Josh Durgin about 12 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

See http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5166

Sage says the cause is down devices only triggering local retries on the same host.


Related issues

Related to RADOS - Bug #2214: crush: pgs only mapped to 2 devices with replication level 3 Resolved 03/26/2012
Duplicated by RADOS - Bug #1738: bad crushmap behavior Duplicate 11/18/2011
Duplicated by Ceph - Bug #2210: osd: some PGs remains remapped or degraded Duplicate 03/25/2012

History

#1 Updated by Sage Weil about 12 years ago

  • Status changed from New to Duplicate

#2 Updated by Sage Weil about 12 years ago

  • Status changed from Duplicate to 12

#3 Updated by Sage Weil about 12 years ago

  • Priority changed from Normal to High

#4 Updated by Sage Weil about 12 years ago

fwiw dropping the local search behavior fixes this bad behavior. the question is what probably was the local search originally for, and how do we address it now. the bit i'm worried about is the exhaustive search fall-back... probably we need to have a something like that at the global level?


diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index df195c7..77370a9 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -421,13 +421,7 @@ reject:
                                        ftotal++;
                                        flocal++;

-                                       if (collide && flocal < 3)
-                                               /* retry locally a few times */
-                                               retry_bucket = 1;
-                                       else if (flocal <= in->size + orig_tries)
-                                               /* exhaustive bucket search */
-                                               retry_bucket = 1;
-                                       else if (ftotal < 20)
+                                       if (ftotal < 20)
                                                /* then retry descent */
                                                retry_descent = 1;
                                        else

#5 Updated by Sage Weil almost 12 years ago

  • Priority changed from High to Normal

#6 Updated by Sage Weil over 11 years ago

  • Status changed from 12 to Resolved

#7 Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (10)

Also available in: Atom PDF