Bug #15653: crush: low weight devices get too many objects for num_rep > 1 - RADOS - Ceph

Bug #15653

Updated by Loïc Dachary about 7 years ago

h3. description, with example 

 CRUSH will correctly choose items with relative weights with the right probabilities for each independent choice.    However, when choosing multiple replicas, each choice is *not* indepent, since it  
 has to be unique.    The result is that low-weighted devices get too many items. 

 Simple example: 
 <pre> 

 maetl:src (master) 03:20 PM $ cat cm.txt 
 # begin crush map 

 # devices 
 device 0 device0 
 device 1 device1 
 device 2 device2 
 device 3 device3 
 device 4 device4 

 # types 
 type 0 osd 
 type 1 domain 
 type 2 pool 

 # buckets 
 domain root { 
	 id -1 		 # do not change unnecessarily 
	 # weight 5.000 
	 alg straw2 
	 hash 0 	 # rjenkins1 
	 item device0 weight 10.00 
	 item device1 weight 10.0 
	 item device2 weight 10.0 
	 item device3 weight 10.0 
	 item device4 weight 1.000 
 } 

 # rules 
 rule data { 
	 ruleset 0 
	 type replicated 
	 min_size 1 
	 max_size 10 
	 step take root 
	 step choose firstn 0 type osd 
	 step emit 
 } 

 # end crush map 
 maetl:src (master) 03:20 PM $ ./crushtool -c cm.txt    -o cm 
 maetl:src (master) 03:20 PM $ ./crushtool -i cm --test --show-utilization --num-rep 1 --min-x 1 --max-x 1000000 --num-rep 1 
 rule 0 (data), x = 1..1000000, numrep = 1..1 
 rule 0 (data) num_rep 1 result size == 1: 	 1000000/1000000 
   device 0: 		  stored : 243456 	  expected : 200000 
   device 1: 		  stored : 243624 	  expected : 200000 
   device 2: 		  stored : 244486 	  expected : 200000 
   device 3: 		  stored : 243881 	  expected : 200000 
   device 4: 		  stored : 24553 	  expected : 200000 
 maetl:src (master) 03:20 PM $ ./crushtool -i cm --test --show-utilization --num-rep 1 --min-x 1 --max-x 1000000 --num-rep 3 
 rule 0 (data), x = 1..1000000, numrep = 3..3 
 rule 0 (data) num_rep 3 result size == 3: 	 1000000/1000000 
   device 0: 		  stored : 723984 	  expected : 600000 
   device 1: 		  stored : 722923 	  expected : 600000 
   device 2: 		  stored : 723153 	  expected : 600000 
   device 3: 		  stored : 723394 	  expected : 600000 
   device 4: 		  stored : 106546 	  expected : 600000 
 </pre> 

 Note that in the 1x case, we get 1/10th the items on device 4, as expected. For 3x, it grows to 1/7th.    For lower weights the amplification is more pronounced. 

 h3. detailed explanation 

 The chances of getting a particular device during the first draw is the weight of the device divided by the sum of the weight of all devices. For example let say there are 5 devices in a bucket, with the following weights a = 10, b = 10, c = 10, d = 10, e = 1. The chances of getting e is 1/41 and the chances of getting a is 10/41. 

 Things get more complicated for the second draw because we have to account for a first draw that does not include a given device: it is the sum of the weight of all devices except the one we're interested in, divided by the weight of all devices. So, if we want to know the chances of e showing up in the second draw, the first draw must not include it and this has a 40/41 chance of happening. Also, during the second draw, the chance of getting e is increased because there is one less device to chose from (the one that was picked during the first draw): 1/31 (i.e. 41 - the weight of the device that was chosen). Because the second draw depends on the first draw, the probability must be multiplied: 40/41 * 1/31. 

 Since the chance of getting the device e in a first draw and getting the device e in a second draw are independent, the chances of getting the device e in both situations is the sum of their probability: 1/41 (first draw) + (40/41 * 1/31) (second draw). 

 This is a special case because all devices have the same weight except for e. If we are to calculate the probability of a being selected in the second draw, we have to sum the case where e is selected and the case where b, c or d is selected during the first draw, because they do not have the same weight. If e is selected during the first draw, a will be selected during the second draw with a probability of (1/41 * 10/40). If b, c, or d is selected during the first draw, a will be selected during the second draw with a probability of (30/41 * 10/31). 

 The chances of getting the device a in the first draw and the second draw is therefore: 10/41+(30/41 * 10/31)+(1/41 * 10/40) 

 To summarize: 

 * probability of getting e : 1/41 * (40/41 * 1/31) = .05586 
 * probability of getting a : 10/41+(30/41 * 10/31)+(1/41 * 10/40) = .48603 

 We are therefore 8.7 ( 0.48603/0.05586 ) more likely to get e than to get a. 

 From the point of view of the users, this is counter intuitive because they expect that the weight reflects the probability, which is only true for a single draw. With just one draw a is (10/41)/(1/41) = 10 times more likely to be selected than e. With two draws, a is only 8.7 times more likely to be selected than e, as shown above.

Back

Project

General

Profile

Ceph » RADOS

Bug #15653