Project

General

Profile

Bug #62213

Updated by Samuel Just 10 months ago

The motivating example was: 

 rule ecpool-86 { 
     id 86 
     type erasure 
     step set_chooseleaf_tries 5 
     step set_choose_tries 100 
     step take default class hdd 
     step choose indep 4 type host 
     step chooseleaf indep 4 type osd 
     step emit 
 } 

 If all of the OSDs in a host are marked out, crush_choose_indep on a leaf bucket with recurse_to_leaf=1 (chooseleaf rather than choose) will populate out2 before the is_out check: 

 <pre> 
				 if (recurse_to_leaf) { 
					 if (item < 0) { 
						 crush_choose_indep( 
							 map, 
							 work, 
							 map->buckets[-1-item], 
							 weight, weight_max, 
							 x, 1, numrep, 0, 
							 out2, rep, 
							 recurse_tries, 0, 
							 0, NULL, r, choose_args); 
						 if (out2 && out2[rep] == CRUSH_ITEM_NONE) { 
							 /* placed nothing; no leaf */ 
							 break; 
						 } 
					 } else if (out2) { 
						 /* we already have a leaf! */ 
						 out2[rep] = item; 
					 } 
				 } 

				 /* out? */ 
				 if (itemtype == 0 && 
				     is_out(map, weight, weight_max, item, x)) 
					 break; 

				 /* yay! */ 
				 out[rep] = item; 
				 left--; 
				 break; 
 </pre> 

 If it exhausts retries (ftotal >= tries), out osds placed into out2 will still be there upon return to the caller resulting in out osds in the mapped set. 

 chooseleaf with any type other than osd won't trigger this bug. 

 This issue can be worked around by using choose rather than chooseleaf (the behavior should otherwise be the same). 

 Note, this is a bug in CRUSH itself, so we can't simply patch the behavior.    It'll need a tunable gated on client capability, etc. 

 A short term mitigation might be to reject the creation of crush rules like this with an error message to use the choose variant instead along with a warning for any pre-existing rules like this. 

 See <branchname> for a unit test reproducer.

Back