##### Site index · List index
Message view
Top
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1094) Intelligent block placement policy to decrease probability of block loss
Date Mon, 19 Jul 2010 20:20:56 GMT
```
[ https://issues.apache.org/jira/browse/HDFS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890000#action_12890000
]

Scott Carey commented on HDFS-1094:
-----------------------------------

bq. I think we are more interested in permanent failures - those that cannot be recovered

That simplifies things.  We can ignore rack failure, which is predominantly an availability
problem not a data loss problem.

Then, that makes the TTR issue primarily about how many nodes in a group per rack.  So the
short answer for keeping TTR from growing too large is that the number of nodes in a rack.

R = replication count.
Lets say that p_0 is our baseline probability that R nodes will simultaneously fail.  (R=3,
p_0 = 0.001 in the calculations above)

Now lets find p, the adjusted probability taking TTR into account.

N = machines per rack in a group.
RB = rack bandwidth.
NB = node bandwidth.

if (NB * N) / RB >= (R-1), then p = p_0.
else, p = p_0 * ((RB * (R-1)) / (NB * N)) ^ (R-1).

Assuming p_0 = 0.001 and R = 3 this is more clearly presented as:

IF (NB * N) >= (RB * 2)
p = 0.001
ELSE
p = 0.001 * ((2* RB)/(NB * N)) ^2

If Rack Bandwidth is 10x node bandwidth, then this is:
IF N >= 20, then p = 0.001
ELSE p = 0.001 * (20/N) ^ 2

In the 100 rack, 2000 node example you have the below as a subsection:
{noformat}
RING GROUPS (window = 10 racks, 5 machines) => 0.000352741
DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175

RING GROUPS (window = 10 racks, 10 machines) => 0.00151145
DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873

RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
{noformat}

I'm not sure if RING GROUPS are better at TTR issues than DISJOINT GROUPS, I haven't thought
through that one.  But assuming its the same then adjusting for TTR gives these adjusted data
loss odds for the above:

{noformat}
(TTR is 4x longer, so probability of loss is 16x)
RING GROUPS (window = 10 racks, 5 machines) => 0.005643856
DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.0028348

(TTR is 2x longer so probability of loss is 4x)
RING GROUPS (window = 10 racks, 10 machines) => 0.0060458
DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.003075492

RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
{noformat}

Interestingly, this almost exactly compensates for the shrinking of group size once  TTR is
limited to the intra-rack bandwidth in the group.

Note that p_0 itself increases with the cluster size.  The more nodes, the higher the likelihood
of co-ocurrance of node failure.

My conclusion is that this is useful especially for large clusters with  large numbers of
nodes per rack, or larger ratios of intra-rack bandwidth to inter-rack bandwidth.  For clusters
on the other end of those spectrums, its hard to beat the current placement algorithm as long
as replication of missing blocks is done at maximum pace.
Of course in the real world, the Namenode does not issue block replication requests at the
maximum pace.  I did some tests a few days ago with three scenarios (decommission, node error,
missing node at startup) and found that the bottleneck in replication is how fast the NN schedules
block replication, not the network or the data nodes.  It schedules block replication in batches
that are too small to saturate the network or disks, and  does not schedule batches aggressively
enough. So one way to increase data reliability in the cluster is to work on that and therefore
reduce TTR.

> Intelligent block placement policy to decrease probability of block loss
> ------------------------------------------------------------------------
>
>                 Key: HDFS-1094
>                 URL: https://issues.apache.org/jira/browse/HDFS-1094
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: dhruba borthakur
>            Assignee: Rodrigo Schmidt
>         Attachments: calculate_probs.py, failure_rate.py, prob.pdf, prob.pdf
>
>
> The current HDFS implementation specifies that the first replica is local and the other
two replicas are on any two random nodes on a random remote rack. This means that if any three
datanodes die together, then there is a non-trivial probability of losing at least one block
in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability
of losing a block.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

```
Mime
View raw message