[ https://issues.apache.org/jira/browse/HDFS1094?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12887012#action_12887012
]
Joydeep Sen Sarma commented on HDFS1094:

(this is the closest i will get to writing a book. dedicated to u Rodrigo)
what else is there to say? i have said it all:
 node group is fixed (at a given point in time) set of nodes
 all the node groups together span the entire cluster
 a given block is replicated only within member nodes of a single node group
this is the definition. there are no 'invalid' node groups. but nodegroups have certain properties
that we know to be highly desirable:
 node groups should ideally, but not necessarily, be disjoint. the probability of data loss
will be positively correlated to number of node groups and inversely correlated to the size
of the node group. overlapping node groups cause excessive number of node groups for the same
sized node group and hence are not desirable.
 should all node groups be of equal size?: ideally yes. this will reduce the loss probability
to a minimum (holding other factors constant)
 all node groups should span multiple racks: this is because we know that racks are a unit
of failure. so to accomodate a replication policy that spans racks  we need the node group
to span racks. how many racks it should span is ideally configurable.
 similarly, we know that hdfs wants the writer to make one node local and one racklocal
copy. that pretty much means that each node group has at least 2 nodes for each rack it spans.
if it's chosen as
 because replication does not cross a nodegroup boundary  recovery from disk/node failure
is constrained by rereplication bandwidth available inside a node group. this is inversely
proportional to the size of the nodegroup. if the nodegroup is too small  then the time
to recovery (from bad disk/node) goes up. so one cannot have an extremely small node group
(like a 2node group).
i cannot suggest the optimal size of a nodegroup without more data on this front. if
one increases MTTR  then the odds of data loss go up. others have said the same thing and
this is data storage 101. so unless someone plots the time to recovery for single node failure
(for a reasonable sized node) vs. the size of a nodegroup  it's hard to say what's high
enough.
if rereplication bandwidth was infinite  then one could have really small nodegroups.
this is what was done in the old days  all disks were mirrored in pairs. and disks were small
enough that a mirrored disk pair had a very small MTTR. of course, disks have become 1000fold
in size since then and their bandwidth hasn't increased much. so this strategy doesn't work
anymore.
 it would be foolhardy to define node groups in a way that puts correlated units of failure
in the same nodegroup. this is pretty obvious. which is why choosing contiguous racks for
a node group may not be a good approach at all (given common rack based rollout strategies
for new hardware as well as likely correlation in temperature)
i would love to learn about how i am oversimplifying things. it's really simple combinatorics
for me:
 when considering simultaneous failures  the data loss probability will be dominated by
what happens when ,number of nodes = degree of replication = 3(say), fail.
 if one considers the 3 node failure scenario  to cause data loss  one must place all the
3 failed nodes in the same nodegroup. the number of ways one can do this is:
a. directly proportional to the number of node groups (obviously)
b. very strongly inversely correlated to the size of the node group
as a simple example  if i were to divide 100 nodes into
* 10 groups of 10 each => 10 * 10C3 = 1200
* 5 groups of 20 each => 5 * 20 C 3 = 5700
as u can see  a 2 fold reduction in group size leads to about 6 fold reduction in number
of ways of hitting failure (and hence in the odds of failure).
 even if u consider higher numbers of failures  similar combinatorics still apply.
i hope the arguments are crystal clear now  all else being equal  it's a bad idea to have
overlapping nodegroups. we will be holding factor (b) constant  but raising factor (a).
beyond this  as i said  i don't have a precise layout strategy right now. the naiivest approach
is what u said u may have already discussed. we don't have heterogenous clusters. even if
we had  we may have (at worst) a mix of 1U and 2U nodes  one could just consider the 2U
nodes to be = 2 x 1U. so in the naiive strategy, i would:
 treat the nodes and racks as a 2D grid.
 random hash the rackids into a different id space for purposes of labelling on this 2D
grid (logically contiguous is not physically contiguous). one could do that with nodes as
well.
 divide the grid into NxM sized disjoint nodegroups (a rectangular subgrid) where N = racks
and M = nodes. Admin decides
* NxM based on rereplication bandwidth
* N based on desired rack fault tolerance
* N >=2 (3 at least i imagine). M >= 2
 on rack addition  on the rackid space  it will be likely inserted in the middle of existing
racks. in the naiive strategy:
 only newly written blocks would be written as per the revised nodegroup definitions
 the system would have more than the minimum possible number of nodegroups (it would
have the new groups as well as the old groups). but that's ok.
i think smarter strategies would differ primarily based on how elegantly they handle rack
insertions. the other way to handle this stuff is to have node groups not be hard wired based
on topology  but just have software generate node groups and their members. it's much easier
to adjust for reconfigurations  one can easily come up with protocols to remap nodegroups
to make full use of new rack  but minimize changes to existing groups. this sort of control
data is rarely mutated and no one dies if everyone's not in sync. so it's an easy thing to
share in a cluster.
> Intelligent block placement policy to decrease probability of block loss
> 
>
> Key: HDFS1094
> URL: https://issues.apache.org/jira/browse/HDFS1094
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: dhruba borthakur
> Assignee: Rodrigo Schmidt
> Attachments: prob.pdf, prob.pdf
>
>
> The current HDFS implementation specifies that the first replica is local and the other
two replicas are on any two random nodes on a random remote rack. This means that if any three
datanodes die together, then there is a nontrivial probability of losing at least one block
in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability
of losing a block.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
