Karthik Ranganathan commented on HDFS1094:

I think there is a slight change in the way probability should be calculated if the block
placement policy enforces that certain blocks reside on a subset of machines. Nevertheless,
I went with the probability of losing data as opposed to the expected number of block losses.
Scheme 1  pick any machine and put blocks there. Further, assume that f = r in your example.
P(of losing data given r failures)
= P(of losing at least 1 block)
= 1  P(of not losing any block)
= 1  (P(of not losing a specific block) ^ B)
= 1  ((1  1/C(N,r)) ^ B)
Scheme 2  assume that you have a fixed pool of machines that you replicate blocks to. For
simplicity, I am going to assume what this means is that there are K machines that contain
a set of blocks and all their replicas. So there are (N/K) such sets of machines. Further,
assuming an even distribution, there are only B/(N/K) blocks in this set of K machines.
P(of losing data given r failures)
= P(r failures being in one set of K machines) * P(of losing at least 1 block in that set)
P(r failures being in one set of K machines) = C(N/K,1)*C(K,r)/C(N,r)
P(of losing at least 1 block in that set) = 1  ((1  1/C(K,r)) ^ (B/(N/K))) > this
follows from the fact that there are K nodes and B/(N/K) blocks.
Plugging in B=30M, N = 1000 and F = 3, r=3, K=60 (replicate all blocks in the previous and
next rack, 20 machines per rack):
Scheme 1 : P(data loss) = 1  ((1  1/C(1000,3)) ^30) = 0.165
Scheme 2 : P(data loss) = P(r failures being in one set of K machines)*P(of losing at least
1 block in that set) = 0.0034 * 1 = 0.0034
Am I doing something wrong?
