Bug #62214
opencrush: a crush rule with multiple choose steps will not retry an earlier step if it chose a bucket with no in OSDs
0%
Description
Debatably, this isn't a bug, but it does appear to be at least a counterintuitive behavior.
In the case of a crush rule like
rule replicated_rule_1 { ... step take default class hdd step chooseleaf firstn 3 type host step emit }
We expect that if all of the osds on a particular host are marked out, mappings including those OSDs would end up on another host (provided that there are enough hosts). Indeed, that's how it works.
Consider instead a rule with two choose steps like
rule replicated_rule_1 { ... step take default class hdd step choose firstn 3 type host step choose firstn 1 type osd step emit }
If we mark a single OSD down, PGs including that OSD would remap to another OSD on the same host. However, if all of the OSDs on a host are marked down, PGs mapped to that host will not be remapped and will be stuck degraded until the host is actually removed from the hierarchy or reweighted to 0.
The motivating example is actually wide EC codes on small clusters:
rule ecpool-86 { id 86 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 4 type host step chooseleaf indep 4 type osd step emit }
In the above case, once a PG has shards on a host, those positions won't remap to another host unless the host is removed or reweighted to 0. See https://github.com/athanatos/ceph/tree/sjust/wip-ec-86-test-62213-62214 for some unit test examples.