[ https://issues.apache.org/jira/browse/HDFS1094?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12856257#action_12856257
]
Karthik Ranganathan commented on HDFS1094:

I think there is a slight change in the way probability should be calculated if the block
placement policy enforces that certain blocks reside on a subset of machines. Nevertheless,
I went with the probability of losing data as opposed to the expected number of block losses.
Scheme 1  pick any machine and put blocks there. Further, assume that f = r in your example.
P(of losing data given r failures)
= P(of losing at least 1 block)
= 1  P(of not losing any block)
= 1  (P(of not losing a specific block) ^ B)
= 1  ((1  1/C(N,r)) ^ B)
Scheme 2  assume that you have a fixed pool of machines that you replicate blocks to. For
simplicity, I am going to assume what this means is that there are K machines that contain
a set of blocks and all their replicas. So there are (N/K) such sets of machines. Further,
assuming an even distribution, there are only B/(N/K) blocks in this set of K machines.
P(of losing data given r failures)
= P(r failures being in one set of K machines) * P(of losing at least 1 block in that set)
P(r failures being in one set of K machines) = C(N/K,1)*C(K,r)/C(N,r)
P(of losing at least 1 block in that set) = 1  ((1  1/C(K,r)) ^ (B/(N/K))) > this
follows from the fact that there are K nodes and B/(N/K) blocks.
Plugging in B=30M, N = 1000 and F = 3, r=3, K=60 (replicate all blocks in the previous and
next rack, 20 machines per rack):
Scheme 1 : P(data loss) = 1  ((1  1/C(1000,3)) ^30) = 0.165
Scheme 2 : P(data loss) = P(r failures being in one set of K machines)*P(of losing at least
1 block in that set) = 0.0034 * 1 = 0.0034
Am I doing something wrong?
> Intelligent block placement policy to decrease probability of block loss
> 
>
> Key: HDFS1094
> URL: https://issues.apache.org/jira/browse/HDFS1094
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
>
> The current HDFS implementation specifies that the first replica is local and the other
two replicas are on any two random nodes on a random remote rack. This means that if any three
datanodes die together, then there is a nontrivial probability of losing at least one block
in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability
of losing a block.

This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa

For more information on JIRA, see: http://www.atlassian.com/software/jira
