Bug #62214: crush: a crush rule with multiple choose steps will not retry an earlier step if it chose a bucket with no in OSDs - RADOS - Ceph

Bug #62214

Debatably, this isn't a bug, but it does appear to be at least a counterintuitive behavior. 

 In the case of a crush rule like 

 <pre> 
 rule replicated_rule_1 { 
     ... 
     step take default class hdd 
     step chooseleaf firstn 3 type host 
     step emit 
 } 
 </pre> 

 We expect that if all of the osds on a particular host are marked out, mappings including those OSDs would end up on another host (provided that there are enough hosts).    Indeed, that's how it works. 

 Consider instead a rule with two choose steps like 

 <pre> 
 rule replicated_rule_1 { 
     ... 
     step take default class hdd 
     step choose firstn 3 type host 
     step choose firstn 1 type osd 
     step emit 
 } 
 </pre> 

 If we mark a single OSD down, PGs including that OSD would remap to another OSD on the same host.    However, if all of the OSDs on a host are marked down, PGs mapped to that host will not be remapped and will be stuck degraded until the host is actually removed from the hierarchy or reweighted to 0. 

 The motivating example is actually wide EC codes on small clusters: 

 <pre> 
 rule ecpool-86 { 
     id 86 
     type erasure 
     step set_chooseleaf_tries 5 
     step set_choose_tries 100 
     step take default class hdd 
     step choose indep 4 type host 
     step chooseleaf indep 4 type osd 
     step emit 
 } 
 </pre> 

 In the above case, once a PG has shards on a host, those positions won't remap to another host unless the host is removed or reweighted to 0.    See https://github.com/athanatos/ceph/tree/sjust/wip-ec-86-test-62213-62214 <branchname> for some unit test examples.

Back

Project

General

Profile

Ceph » RADOS

Bug #62214