hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-692) Rack-aware Replica Placement
Date Tue, 14 Nov 2006 23:49:38 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-692?page=comments#action_12449865 ] 
Sameer Paranjpye commented on HADOOP-692:

>  Sould we consider a mobile network ...

No, that is not the concern. The concern is that in a large enough installation (1000s of
nodes) there are going to be a several instances a week of  nodes going down and getting repaired.
When the nodes return they may not be in the same racks or on the same switches. This probably
does not happen by design, but mistakes are always possible.

> I would simply update network toplogy when a datanode registers or exits.

Certainly, that seems like the right approach. The question is how the topology is updated.
Is it by updating a central configuration file and having the namenode read it? This implies
potentially updating configuration every time anything changes in a datacenter.

Is it by running timing experiments every time a datanode registers? This can be biased by
transient network conditions.

Neither of the above seem like a productive use of admin or namenode time.

Instead of a network topology interface, why not have a network location interface on the

public interface NetworkLocation {
  String[] getLocation();

This returns an array of hubs that characterize a nodes location on the network. This is probably
an array of string of the form, <key>=<value>. So a possible output could be:

rack=r1, switch=s2, datacenter=d3, ... 

as many levels as are desirable. Also, network distances between nodes aren't the only things
that are interesting. I think it's useful to distinguish between a rack and a switch because
a rack is commonly a physical power domain.

Given this output from each Datanode we can then have a concrete implementation of NetworkTopology
that simply tracks the membership of each hub. Finding the distance between two nodes is done
by comparing their arrays of hubs and stopping where they differ.

> Allocating all blocks of a file to the same 3 racks limits the aggaregate read bandwith.

It limits the aggregate read bandwidth to that particular file only. Every file will have
it's own set of 3 racks, so I don't think it affects overall filesystem bandwidth. On the
other hand, it potentially gives a client  that is reading an entire file better locality.

> Users may specify its replica placement policy 

I agree with Doug. This seems like a subsequent feature.

> Rack-aware Replica Placement
> ----------------------------
>                 Key: HADOOP-692
>                 URL: http://issues.apache.org/jira/browse/HADOOP-692
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.8.0
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>             Fix For: 0.9.0
> This issue assumes that HDFS runs on a cluster of computers that spread across many racks.
Communication between two nodes on different racks needs to go through switches. Bandwidth
in/out of a rack may be less than the total bandwidth of machines in the rack. The purpose
of rack-aware replica placement is to improve data reliability, availability, and network
bandwidth utilization. The basic idea is that each data node determines to which rack it belongs
at the startup time and notifies the name node of the rack id upon registration. The name
node maintains a rackid-to-datanode map and tries to place replicas across racks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message