Bug #62213
Updated by Samuel Just 10 months ago
The motivating example was: rule ecpool-86 { id 86 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 4 type host step chooseleaf indep 4 type osd step emit } If all of the OSDs in a host are marked out, crush_choose_indep on a leaf bucket with recurse_to_leaf=1 (chooseleaf rather than choose) will populate out2 before the is_out check: <pre> if (recurse_to_leaf) { if (item < 0) { crush_choose_indep( map, work, map->buckets[-1-item], weight, weight_max, x, 1, numrep, 0, out2, rep, recurse_tries, 0, 0, NULL, r, choose_args); if (out2 && out2[rep] == CRUSH_ITEM_NONE) { /* placed nothing; no leaf */ break; } } else if (out2) { /* we already have a leaf! */ out2[rep] = item; } } /* out? */ if (itemtype == 0 && is_out(map, weight, weight_max, item, x)) break; /* yay! */ out[rep] = item; left--; break; </pre> If it exhausts retries (ftotal >= tries), out osds placed into out2 will still be there upon return to the caller resulting in out osds in the mapped set. chooseleaf with any type other than osd won't trigger this bug. This issue can be worked around by using choose rather than chooseleaf (the behavior should otherwise be the same). It's hard to predict the impact of Note, this bug in a real cluster. There are relatively few clusters with multiple choose steps like this. If the out OSDs are also down, they'll be filtered out later anyway. Nevertheless, it's probably worth avoiding as CRUSH isn't supposed to be able to map out OSDs. This is a bug in CRUSH itself, so we can't simply patch the behavior. It'll need a tunable gated on client capability, etc. A short term mitigation might be to reject the creation of crush rules like this with an error message to use the choose variant instead along with a warning for any pre-existing rules like this. See <branchname> for a unit test reproducer.