[ https://issues.apache.org/jira/browse/HDFS1094?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12887136#action_12887136
]
Scott Carey commented on HDFS1094:

The "node group" versus Rodrigo's definition:
Node groups  yes, overlap is bad. But, no overlap is also bad. Consider:
If there is no overlap, and the policy is to write a 3 replica block to two racks, one with
two replicas and another with one, then the minimal group size is four  you must have two
rack and you must have at least two nodes per rack. A larger group must always have two
nodes per rack. With N racks in the node group, a rack failure causes 1/N of the blocks
to be reduced to 1 replica.
It is possible to define overlapping groups with a better worstcase scenario here, with the
tradeoff being increased risk due to overlap.
However, this whole enterprise needs to be built on a better foundation.
some problems
h5. Disks versus nodes:
Start considering disk failure, not just node failure. Newer Hadoop versions don't fail
the node when one disk in the node fails. There are several strategies to reduce data loss
without impacting time to replicate regarding disk placement instead of node placement. This
is because the dominant factor in time to replicate is network bandwidth, which can only be
improved for replication if node groups or placement policy spans more racks and nodes.
What we should really be talking about here is "disk groups" and "disk placement policy" not
"node groups" and "node placement policy".
Disk is to Node as Node is to Rack.
Limiting the combinatorics at the rack level is the most dangerous and increases time to recovery
the most.
Limiting the combinatorics at the node level is moderately dangerous because it increases
time to recovery somewhat.
Limiting it at the disk level is a 'free' win, since it does not increase the time to recovery
much at all  that is limited mostly by network bandwidth.
Consider this  in a 400 node cluster each with 4 disks, there are C(400,3) ways for node
failure to occur.
But, there are C(1600,3) ways for disk failure to occur. If you restrict the way replicas
are placed not by node or rack, but simply by disk (say, take the block # mod the # of disks
in a node to choose which node), then this reduces the number of disk failure combinations
that can cause data loss from C(1600,3) to 4*C(400,3). This is a factor of 16 improvement
(# of disks ^2) without any loss in TTR as long as disk speed >= network speed.
h5. Time to recovery. Since the odds of 3 simultaneous failures depend highly on TTR, this
is a huge factor.
Here is a simple example of how time to recovery is impacted. The time to recover is roughly
equal to:
The maximum ammount of data that a single 'survivor' node has that must be replicated. If
we simplify this to situations where replication has dropped to 1, this is easy to analyze.
You must get the data off that worst case node.
There is a similar analysis that has to be done for a rack as well  'sole survivor' blocks
in a rack have to be moved off rack or data loss will be a problem.
For a node, the limit is the network, usually 1Gbit/sec or ~100MBsec. To move 1TB of data,
that takes ~10,000 seconds, or 2 hours and 47 minutes! Brand new Hadoop hardware tends to
have 8TB on a node these days (though network bonding to 2Gbps is more common).
Bandwidth between racks tends to be between 1/16 and 1/4 of the aggregate bandwidth within
a rack. So worst case time to replicate tends to be longer than for a node if the goal is
to make sure all lost data exists on at least two racks.
So how does that affect things?
Lets assume a node has a 1 in 10,000 chance of failure in a given hour (there are 8760 hours
in a year). Lets assume only 3 nodes for now.
If a node fails, then the odds of the other two failing are (1/10000)^2, one in 100,000,000.
If the time to recover is 3 hours instead of 1, the odds are (3/10000)^2, one in 11,111,111.
Increasing the time to recover by a factor of 3 increases the chance of data loss by a factor
of 9. If this was a 4replica situation, it would be an even stronger relation.
The relationship between TTR and chance of data loss is
(TTR)^(R1)
where TTR is the time to recover and R is the number of replicas.
For node failure, we consider the number of replicas = 3. For rack failure, it is 2 (blocks
are only on 2 racks). So node failure is more sensitive to TTR than rack failure, but the
latter is obviously more dangerous.
h5. "node groups" or "restricted placement" and TTR
If the replication algorithm prioritizes the lowest replication count blocks first, then the
TTR is proportional to the size of the 'sole survivor' data on the node with the most such
data.
M racks, N nodes per rack, 4TB data per node, 1GBps node bandwidth, 10Gbps rack bandwidth.
Analysis  current (random) policy: write local, then choose random rack and node for second
replica, and for third replica choose random node in same rack as second node.
* Rack Failure: When a rack fails, it leaves behind many 'sole survivor' blocks. In general,
2/3 of its blocks have only one replica outside the rack and require intrarack transfer.
So 2N * 4TB/3 GB have to be transferred. The other 1/3 can be transferred in rack, which
is not the limiting factor here.
These replicas are evenly spread over the remaining M1 racks, and copying immediately starts
 replicating blocks to another rack and then to another node in that other rack. The rack
bandwidth is the limiting factor:
2/3(N * 4TB) /(M1) transferred off of each rack to another. At 10Gbps rack bandwidth this
is ~1000MB/sec. If we assume 40 servers in a rack and 10 racks this is ~11.8TB at 1000MB/sec
or ~=11800 seconds ~=3.3 hours.
* Node Failure: A single node, provided a perfect placement policy by the NN, balancer, and
writers, should not have two replicas of one block. But its TTR to get the replica count back
to the prefailure state is as follows:
All 4TB of blocks have to move, but these are spread out in many racks. 1/3 of the node's
data was written by it, and those replicas are all offrack in pairs. Only one of these pairs
however has to be considered to generate a copy. However, in order to maintain good block
placement this replication has to go off rack. So all replication goes between racks for
1/3 the data. 2/3 of the data is 'paired' data where one other block is already inrack and
the other is in another rack. This data can be replicated within rack. If scheduled optimally,
2/3 the data is replicated in rack evenly, and 1/3 is replicated cross rack evenly.
Inrack replication is limited by the node bandwidth, and the 2/3 of the data that needs this
sort of replication is spread across the remaining nodes  it is not spread evenly (there
is more in the rack that had the node fail) but will be anywhere on the cluster. So this
is 2/3 * 4TB / (N*M 1) = 6.7GB per node @ 1Gbps = 67 seconds.
The 1/3 of the data that must flow between racks takes more  it is evenly distributed across
nodes on other racks, so each rack must transfer 1/3 * 4TB / M data off it, in this example
133GB at 10Gbps = 133 seconds.
Now, what about a policy that created the minimum 4 node 'node group'. Two in one rack, two
in another? If a rack dies, the calculation is similar to above. Assuming node groups have
random pair other racks with no limitations , then the "survivor" parts of the group must
choose new partner node pairs and replicate to them, offrack. 2/3 of the blocks will have
only one replica, 1/3 will have two. None of it is offrack, so all of it must be transferred.
This means the time to replicate is 3/2 the policy above, about 4.9 hours. This directly
translates into a 50% increased chance of data loss or availability loss from rack failure.
When a node dies out of the 'set' of four, all of the blocks it wrote are on the two nodes
in the other rack (1/3 of its data). 2/3 of its data was shared by its 'peer' in the same
rack, which now must replicate all of its data to a new peer in the same rack or elsewhere.
Lets give it the benefit of the doubt and assume it can choose a peer in the same rack.
It is limited by this replication  2/3 of 4TB from one node to another. This takes 7.4
hours. The other nodes in the other rack must also write to this peer to up their replication
from 2 to 3, which increases this time even further because it can only write so fast. Those
blocks can be replicated to the other node instead, but then these will be very unbalanced.
The absolute minimum time is the 7.4 hours above, and the 'balanced' time is 1.5x that.
h4. factoring in TTR
So, if the odds that a failure of 3 nodes with replication = 3 loses a block's data on a 400
node cluster with uniform block placement is one in C(400,3) = one in 10.6 million.
With 100 groups of 4 nodes, the likelihood is the same, for a given block. However, for all
blocks, it is reduced with the exchange that when a failure occurs, more blocks will be lost.
There was C(400,3) different combinations that would lead to data loss, and now there are
only (100 * C(4,3)) that will.
This is a reduction in data loss probability by (C(400,3)/(100*C(4,3))) or a factor of 158,000.
However, since the time to replicate is now not 133 seconds, but 26,667 to 40,000 seconds,
200 to 300 times longer. This increases the likelihood of such a 3 node failure dramatically.
At _best_ this results in an increase in the chance of such a condition by a factor of 40,000.
And for blocks that are higher replication, its even worse  8,000,000+.
So, while this strategy will likely help out repication = 2 loss, and sometimes help replication
= 3, it is disastrous for replication = 4 or more.
Furthermore, it complicates block placement, removal, and emergency replication significantly.
To have higher data safety, I feel that working on reducing TTR and preventing normal cluster
activity from causing bad placement and distribution is most important.
However, a diskbased strategy can be of use without impacting TTR. Disk failure, node failure,
and rack failure are all different beasts here.
> Intelligent block placement policy to decrease probability of block loss
> 
>
> Key: HDFS1094
> URL: https://issues.apache.org/jira/browse/HDFS1094
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: dhruba borthakur
> Assignee: Rodrigo Schmidt
> Attachments: prob.pdf, prob.pdf
>
>
> The current HDFS implementation specifies that the first replica is local and the other
two replicas are on any two random nodes on a random remote rack. This means that if any three
datanodes die together, then there is a nontrivial probability of losing at least one block
in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability
of losing a block.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
