[ https://issues.apache.org/jira/browse/HDFS1094?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12890000#action_12890000
]
Scott Carey commented on HDFS1094:

bq. I think we are more interested in permanent failures  those that cannot be recovered
That simplifies things. We can ignore rack failure, which is predominantly an availability
problem not a data loss problem.
Then, that makes the TTR issue primarily about how many nodes in a group per rack. So the
short answer for keeping TTR from growing too large is that the number of nodes in a rack.
R = replication count.
Lets say that p_0 is our baseline probability that R nodes will simultaneously fail. (R=3,
p_0 = 0.001 in the calculations above)
Now lets find p, the adjusted probability taking TTR into account.
N = machines per rack in a group.
RB = rack bandwidth.
NB = node bandwidth.
if (NB * N) / RB >= (R1), then p = p_0.
else, p = p_0 * ((RB * (R1)) / (NB * N)) ^ (R1).
Assuming p_0 = 0.001 and R = 3 this is more clearly presented as:
IF (NB * N) >= (RB * 2)
p = 0.001
ELSE
p = 0.001 * ((2* RB)/(NB * N)) ^2
If Rack Bandwidth is 10x node bandwidth, then this is:
IF N >= 20, then p = 0.001
ELSE p = 0.001 * (20/N) ^ 2
In the 100 rack, 2000 node example you have the below as a subsection:
{noformat}
RING GROUPS (window = 10 racks, 5 machines) => 0.000352741
DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175
RING GROUPS (window = 10 racks, 10 machines) => 0.00151145
DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873
RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
{noformat}
I'm not sure if RING GROUPS are better at TTR issues than DISJOINT GROUPS, I haven't thought
through that one. But assuming its the same then adjusting for TTR gives these adjusted data
loss odds for the above:
{noformat}
(TTR is 4x longer, so probability of loss is 16x)
RING GROUPS (window = 10 racks, 5 machines) => 0.005643856
DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.0028348
(TTR is 2x longer so probability of loss is 4x)
RING GROUPS (window = 10 racks, 10 machines) => 0.0060458
DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.003075492
RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
{noformat}
Interestingly, this almost exactly compensates for the shrinking of group size once TTR is
limited to the intrarack bandwidth in the group.
Note that p_0 itself increases with the cluster size. The more nodes, the higher the likelihood
of coocurrance of node failure.
My conclusion is that this is useful especially for large clusters with large numbers of
nodes per rack, or larger ratios of intrarack bandwidth to interrack bandwidth. For clusters
on the other end of those spectrums, its hard to beat the current placement algorithm as long
as replication of missing blocks is done at maximum pace.
Of course in the real world, the Namenode does not issue block replication requests at the
maximum pace. I did some tests a few days ago with three scenarios (decommission, node error,
missing node at startup) and found that the bottleneck in replication is how fast the NN schedules
block replication, not the network or the data nodes. It schedules block replication in batches
that are too small to saturate the network or disks, and does not schedule batches aggressively
enough. So one way to increase data reliability in the cluster is to work on that and therefore
reduce TTR.
> Intelligent block placement policy to decrease probability of block loss
> 
>
> Key: HDFS1094
> URL: https://issues.apache.org/jira/browse/HDFS1094
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: dhruba borthakur
> Assignee: Rodrigo Schmidt
> Attachments: calculate_probs.py, failure_rate.py, prob.pdf, prob.pdf
>
>
> The current HDFS implementation specifies that the first replica is local and the other
two replicas are on any two random nodes on a random remote rack. This means that if any three
datanodes die together, then there is a nontrivial probability of losing at least one block
in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability
of losing a block.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
