Bug #62213
Updated by Samuel Just 10 months ago
The motivating example was:
rule ecpool-86 {
id 86
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 4 type host
step chooseleaf indep 4 type osd
step emit
}
If all of the OSDs in a host are marked out, crush_choose_indep on a leaf bucket with recurse_to_leaf=1 (chooseleaf rather than choose) will populate out2 before the is_out check:
<pre>
if (recurse_to_leaf) {
if (item < 0) {
crush_choose_indep(
map,
work,
map->buckets[-1-item],
weight, weight_max,
x, 1, numrep, 0,
out2, rep,
recurse_tries, 0,
0, NULL, r, choose_args);
if (out2 && out2[rep] == CRUSH_ITEM_NONE) {
/* placed nothing; no leaf */
break;
}
} else if (out2) {
/* we already have a leaf! */
out2[rep] = item;
}
}
/* out? */
if (itemtype == 0 &&
is_out(map, weight, weight_max, item, x))
break;
/* yay! */
out[rep] = item;
left--;
break;
</pre>
If it exhausts retries (ftotal >= tries), out osds placed into out2 will still be there upon return to the caller resulting in out osds in the mapped set.
chooseleaf with any type other than osd won't trigger this bug.
This issue can be worked around by using choose rather than chooseleaf (the behavior should otherwise be the same).
Note, this is a bug in CRUSH itself, so we can't simply patch the behavior. It'll need a tunable gated on client capability, etc.
A short term mitigation might be to reject the creation of crush rules like this with an error message to use the choose variant instead along with a warning for any pre-existing rules like this.
See <branchname> for a unit test reproducer.