## Bug #15653

Updated by Loic Dachary 2 months ago

h3. description, with example

CRUSH will correctly choose items with relative weights with the right probabilities for each independent choice. However, when choosing multiple replicas, each choice is *not* indepent, since it

has to be unique. The result is that low-weighted devices get too many items.

Simple example:

<pre>

maetl:src (master) 03:20 PM $ cat cm.txt

# begin crush map

# devices

device 0 device0

device 1 device1

device 2 device2

device 3 device3

device 4 device4

# types

type 0 osd

type 1 domain

type 2 pool

# buckets

domain root {

id -1 # do not change unnecessarily

# weight 5.000

alg straw2

hash 0 # rjenkins1

item device0 weight 10.00

item device1 weight 10.0

item device2 weight 10.0

item device3 weight 10.0

item device4 weight 1.000

}

# rules

rule data {

ruleset 0

type replicated

min_size 1

max_size 10

step take root

step choose firstn 0 type osd

step emit

}

# end crush map

maetl:src (master) 03:20 PM $ ./crushtool -c cm.txt -o cm

maetl:src (master) 03:20 PM $ ./crushtool -i cm --test --show-utilization --num-rep 1 --min-x 1 --max-x 1000000 --num-rep 1

rule 0 (data), x = 1..1000000, numrep = 1..1

rule 0 (data) num_rep 1 result size == 1: 1000000/1000000

device 0: stored : 243456 expected : 200000

device 1: stored : 243624 expected : 200000

device 2: stored : 244486 expected : 200000

device 3: stored : 243881 expected : 200000

device 4: stored : 24553 expected : 200000

maetl:src (master) 03:20 PM $ ./crushtool -i cm --test --show-utilization --num-rep 1 --min-x 1 --max-x 1000000 --num-rep 3

rule 0 (data), x = 1..1000000, numrep = 3..3

rule 0 (data) num_rep 3 result size == 3: 1000000/1000000

device 0: stored : 723984 expected : 600000

device 1: stored : 722923 expected : 600000

device 2: stored : 723153 expected : 600000

device 3: stored : 723394 expected : 600000

device 4: stored : 106546 expected : 600000

</pre>

Note that in the 1x case, we get 1/10th the items on device 4, as expected. For 3x, it grows to 1/7th. For lower weights the amplification is more pronounced.

h3. detailed explanation

The chances of getting a particular device during the first draw is the weight of the device divided by the sum of the weight of all devices. For example let say there are 5 devices in a bucket, with the following weights a = 10, b = 10, c = 10, d = 10, e = 1. The chances of getting e is 1/41 and the chances of getting a is 10/41.

Things get more complicated for the second draw because we have to account for a first draw that does not include a given device: it is the sum of the weight of all devices except the one we're interested in, divided by the weight of all devices. So, if we want to know the chances of e showing up in the second draw, the first draw must not include it and this has a 40/41 chance of happening. Also, during the second draw, the chance of getting e is increased because there is one less device to chose from (the one that was picked during the first draw): 1/31 (i.e. 41 - the weight of the device that was chosen). Because the second draw depends on the first draw, the probability must be multiplied: 40/41 * 1/31.

Since the chance of getting the device e in a first draw and getting the device e in a second draw are independent, the chances of getting the device e in both situations is the sum of their probability: 1/41 (first draw) + (40/41 * 1/31) (second draw).

This is a special case because all devices have the same weight except for e. If we are to calculate the probability of a being selected in the second draw, we have to sum the case where e is selected and the case where b, c or d is selected during the first draw, because they do not have the same weight. If e is selected during the first draw, a will be selected during the second draw with a probability of (1/41 * 10/40). If b, c, or d is selected during the first draw, a will be selected during the second draw with a probability of (30/41 * 10/31).

The chances of getting the device a in the first draw and the second draw is therefore: 10/41+(30/41 * 10/31)+(1/41 * 10/40)

To summarize:

* probability of getting e : 1/41 * (40/41 * 1/31) = .05586

* probability of getting a : 10/41+(30/41 * 10/31)+(1/41 * 10/40) = .48603

We are therefore 8.7 ( 0.48603/0.05586 ) more likely to get e than to get a.

From the point of view of the users, this is counter intuitive because they expect that the weight reflects the probability, which is only true for a single draw. With just one draw a is (10/41)/(1/41) = 10 times more likely to be selected than e. With two draws, a is only 8.7 times more likely to be selected than e, as shown above.

CRUSH will correctly choose items with relative weights with the right probabilities for each independent choice. However, when choosing multiple replicas, each choice is *not* indepent, since it

has to be unique. The result is that low-weighted devices get too many items.

Simple example:

<pre>

maetl:src (master) 03:20 PM $ cat cm.txt

# begin crush map

# devices

device 0 device0

device 1 device1

device 2 device2

device 3 device3

device 4 device4

# types

type 0 osd

type 1 domain

type 2 pool

# buckets

domain root {

id -1 # do not change unnecessarily

# weight 5.000

alg straw2

hash 0 # rjenkins1

item device0 weight 10.00

item device1 weight 10.0

item device2 weight 10.0

item device3 weight 10.0

item device4 weight 1.000

}

# rules

rule data {

ruleset 0

type replicated

min_size 1

max_size 10

step take root

step choose firstn 0 type osd

step emit

}

# end crush map

maetl:src (master) 03:20 PM $ ./crushtool -c cm.txt -o cm

maetl:src (master) 03:20 PM $ ./crushtool -i cm --test --show-utilization --num-rep 1 --min-x 1 --max-x 1000000 --num-rep 1

rule 0 (data), x = 1..1000000, numrep = 1..1

rule 0 (data) num_rep 1 result size == 1: 1000000/1000000

device 0: stored : 243456 expected : 200000

device 1: stored : 243624 expected : 200000

device 2: stored : 244486 expected : 200000

device 3: stored : 243881 expected : 200000

device 4: stored : 24553 expected : 200000

maetl:src (master) 03:20 PM $ ./crushtool -i cm --test --show-utilization --num-rep 1 --min-x 1 --max-x 1000000 --num-rep 3

rule 0 (data), x = 1..1000000, numrep = 3..3

rule 0 (data) num_rep 3 result size == 3: 1000000/1000000

device 0: stored : 723984 expected : 600000

device 1: stored : 722923 expected : 600000

device 2: stored : 723153 expected : 600000

device 3: stored : 723394 expected : 600000

device 4: stored : 106546 expected : 600000

</pre>

Note that in the 1x case, we get 1/10th the items on device 4, as expected. For 3x, it grows to 1/7th. For lower weights the amplification is more pronounced.

h3. detailed explanation

The chances of getting a particular device during the first draw is the weight of the device divided by the sum of the weight of all devices. For example let say there are 5 devices in a bucket, with the following weights a = 10, b = 10, c = 10, d = 10, e = 1. The chances of getting e is 1/41 and the chances of getting a is 10/41.

Things get more complicated for the second draw because we have to account for a first draw that does not include a given device: it is the sum of the weight of all devices except the one we're interested in, divided by the weight of all devices. So, if we want to know the chances of e showing up in the second draw, the first draw must not include it and this has a 40/41 chance of happening. Also, during the second draw, the chance of getting e is increased because there is one less device to chose from (the one that was picked during the first draw): 1/31 (i.e. 41 - the weight of the device that was chosen). Because the second draw depends on the first draw, the probability must be multiplied: 40/41 * 1/31.

Since the chance of getting the device e in a first draw and getting the device e in a second draw are independent, the chances of getting the device e in both situations is the sum of their probability: 1/41 (first draw) + (40/41 * 1/31) (second draw).

This is a special case because all devices have the same weight except for e. If we are to calculate the probability of a being selected in the second draw, we have to sum the case where e is selected and the case where b, c or d is selected during the first draw, because they do not have the same weight. If e is selected during the first draw, a will be selected during the second draw with a probability of (1/41 * 10/40). If b, c, or d is selected during the first draw, a will be selected during the second draw with a probability of (30/41 * 10/31).

The chances of getting the device a in the first draw and the second draw is therefore: 10/41+(30/41 * 10/31)+(1/41 * 10/40)

To summarize:

* probability of getting e : 1/41 * (40/41 * 1/31) = .05586

* probability of getting a : 10/41+(30/41 * 10/31)+(1/41 * 10/40) = .48603

We are therefore 8.7 ( 0.48603/0.05586 ) more likely to get e than to get a.

From the point of view of the users, this is counter intuitive because they expect that the weight reflects the probability, which is only true for a single draw. With just one draw a is (10/41)/(1/41) = 10 times more likely to be selected than e. With two draws, a is only 8.7 times more likely to be selected than e, as shown above.