hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1094) Intelligent block placement policy to decrease probability of block loss
Date Sat, 10 Jul 2010 08:01:55 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887012#action_12887012

Joydeep Sen Sarma commented on HDFS-1094:

(this is the closest i will get to writing a book. dedicated to u Rodrigo)

what else is there to say? i have said it all:

- node group is fixed (at a given point in time) set of nodes
- all the node groups together span the entire cluster
- a given block is replicated only within member nodes of a single node group

this is the definition. there are no 'invalid' node groups. but node-groups have certain properties
that we know to be highly desirable:

- node groups should ideally, but not necessarily, be disjoint. the probability of data loss
will be positively correlated to number of node groups and inversely correlated to the size
of the node group. overlapping node groups cause excessive number of node groups for the same
sized node group and hence are not desirable.

- should all node groups be of equal size?: ideally yes. this will reduce the loss probability
to a minimum (holding other factors constant)

- all node groups should span multiple racks: this is because we know that racks are a unit
of failure. so to accomodate a replication policy that spans racks - we need the node group
to span racks. how many racks it should span is ideally configurable.

- similarly, we know that hdfs wants the writer to make one node local and one rack-local
copy. that pretty much means that each node group has at least 2 nodes for each rack it spans.
if it's chosen as 

- because replication does not cross a node-group boundary - recovery from disk/node failure
is constrained by re-replication bandwidth available inside a node group. this is inversely
proportional to the size of the node-group. if the node-group is too small - then the time
to recovery (from bad disk/node) goes up. so one cannot have an extremely small node group
(like a 2-node group). 

    i cannot suggest the optimal size of a node-group without more data on this front. if
one increases MTTR - then the odds of data loss go up. others have said the same thing and
this is data storage 101. so unless someone plots the time to recovery for single node failure
(for a reasonable sized node) vs. the size of a node-group - it's hard to say what's high

    if re-replication bandwidth was infinite - then one could have really small node-groups.
this is what was done in the old days - all disks were mirrored in pairs. and disks were small
enough that a mirrored disk pair had a very small MTTR. of course, disks have become 1000-fold
in size since then and their bandwidth hasn't increased much. so this strategy doesn't work

- it would be foolhardy to define node groups in a way that puts correlated units of failure
in the same node-group. this is pretty obvious. which is why choosing contiguous racks for
a node group may not be a good approach at all (given common rack based rollout strategies
for new hardware as well as likely correlation in temperature)

i would love to learn about how i am oversimplifying things. it's really simple combinatorics
for me:

- when considering simultaneous failures - the data loss probability will be dominated by
what happens when ,number of nodes = degree of replication = 3(say), fail.

- if one considers the 3 node failure scenario - to cause data loss - one must place all the
3 failed nodes in the same node-group. the number of ways one can do this is:
  a. directly proportional to the number of node groups (obviously)
  b. very strongly inversely correlated to the size of the node group

as a simple example - if i were to divide 100 nodes into
  * 10 groups of 10 each => 10 * 10C3 = 1200
  * 5 groups of 20 each => 5 * 20 C 3 = 5700

as u can see - a 2 fold reduction in group size leads to about 6 fold reduction in number
of ways of hitting failure (and hence in the odds of failure). 

- even if u consider higher numbers of failures - similar combinatorics still apply.

i hope the arguments are crystal clear now - all else being equal - it's a bad idea to have
overlapping node-groups. we will be holding factor (b) constant - but raising factor (a).

beyond this - as i said - i don't have a precise layout strategy right now. the naiivest approach
is what u said u may have already discussed. we don't have heterogenous clusters. even if
we had - we may have (at worst) a mix of 1U and 2U nodes - one could just consider the 2U
nodes to be = 2 x 1U. so in the naiive strategy, i would:

- treat the nodes and racks as a 2-D grid.

- random hash the rackids into a different id space for purposes of labelling on this 2-D
grid (logically contiguous is not physically contiguous). one could do that with nodes as

- divide the grid into NxM sized disjoint node-groups (a rectangular sub-grid) where N = racks
and M = nodes. Admin decides
  * NxM based on re-replication bandwidth 
  * N based on desired rack fault tolerance 
  * N >=2 (3 at least i imagine). M >= 2

- on rack addition - on the rackid space - it will be likely inserted in the middle of existing
racks. in the naiive strategy:
   - only newly written blocks would be written as per the revised node-group definitions
   - the system would have more than the minimum possible number of node-groups (it would
have the new groups as well as the old groups). but that's ok.

i think smarter strategies would differ primarily based on how elegantly they handle rack
insertions. the other way to handle this stuff is to have node groups not be hard wired based
on topology - but just have software generate node groups and their members. it's much easier
to adjust for reconfigurations - one can easily come up with protocols to remap node-groups
to make full use of new rack - but minimize changes to existing groups. this sort of control
data is rarely mutated and no one dies if everyone's not in sync. so it's an easy thing to
share in a cluster.

> Intelligent block placement policy to decrease probability of block loss
> ------------------------------------------------------------------------
>                 Key: HDFS-1094
>                 URL: https://issues.apache.org/jira/browse/HDFS-1094
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: dhruba borthakur
>            Assignee: Rodrigo Schmidt
>         Attachments: prob.pdf, prob.pdf
> The current HDFS implementation specifies that the first replica is local and the other
two replicas are on any two random nodes on a random remote rack. This means that if any three
datanodes die together, then there is a non-trivial probability of losing at least one block
in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability
of losing a block.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message