Bug #62213: crush: choose leaf with type = 0 may incorrectly map out osds - RADOS - Ceph

Bug #62213

The motivating example was: 

 rule ecpool-86 { 
     id 86 
     type erasure 
     step set_chooseleaf_tries 5 
     step set_choose_tries 100 
     step take default class hdd 
     step choose indep 4 type host 
     step chooseleaf indep 4 type osd 
     step emit 
 } 

 If all of the OSDs in a host are marked out, crush_choose_indep on a leaf bucket with recurse_to_leaf=1 (chooseleaf rather than choose) will populate out2 before the is_out check: 

 <pre> 
				 if (recurse_to_leaf) { 
					 if (item < 0) { 
						 crush_choose_indep( 
							 map, 
							 work, 
							 map->buckets[-1-item], 
							 weight, weight_max, 
							 x, 1, numrep, 0, 
							 out2, rep, 
							 recurse_tries, 0, 
							 0, NULL, r, choose_args); 
						 if (out2 && out2[rep] == CRUSH_ITEM_NONE) { 
							 /* placed nothing; no leaf */ 
							 break; 
						 } 
					 } else if (out2) { 
						 /* we already have a leaf! */ 
						 out2[rep] = item; 
					 } 
				 } 

				 /* out? */ 
				 if (itemtype == 0 && 
				     is_out(map, weight, weight_max, item, x)) 
					 break; 

				 /* yay! */ 
				 out[rep] = item; 
				 left--; 
				 break; 
 </pre> 

 If it exhausts retries (ftotal >= tries), out osds placed into out2 will still be there upon return to the caller resulting in out osds in the mapped set. 

 chooseleaf with any type other than osd won't trigger this bug. 

 This issue can be worked around by using choose rather than chooseleaf (the behavior should otherwise be the same). 

 It's hard to predict the impact of Note, this bug in a real cluster.    There are relatively few clusters with multiple choose steps like this.    If the out OSDs are also down, they'll be filtered out later anyway.    Nevertheless, it's probably worth avoiding as CRUSH isn't supposed to be able to map out OSDs. 

 This is a bug in CRUSH itself, so we can't simply patch the behavior.    It'll need a tunable gated on client capability, etc. 

 A short term mitigation might be to reject the creation of crush rules like this with an error message to use the choose variant instead along with a warning for any pre-existing rules like this. 

 See <branchname> for a unit test reproducer.

Back

Project

General

Profile

Ceph » RADOS

Bug #62213