hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Hadoop Cluster Multi-datacenter
Date Tue, 07 Jun 2011 12:57:29 GMT

On Jun 7, 2011, at 12:07 AM, sanjeev.taran@us.pwc.com wrote:

> Hello,
> 
> I wanted to know if anyone has any tips or tutorials on howto install the 
> hadoop cluster on multiple datacenters
> 

Generally, this is a bad idea.  Why?
1) Inter-datacenter bandwidth is expensive compared to cluster bandwidth.
2) This extra topological constraint is not currently well-modeled in the Hadoop architecture.
 This means that you will likely find assumptions in the software that are not true in the
inter-datacenter case.
3) None of the biggest users currently do this.  Until you plan on putting serious money into
the game, follow what is well-established to work.

I would note that, in my other life, I work with a batch-oriented distributed computing system
called Condor (http://www.cs.wisc.edu/condor/).  Condor is designed to naturally span the
globe (I've seen it spanning around 50 clusters).  However, it is batch job oriented, not
data oriented.  If you have to wedge your problem to fit into the MapReduce paradigm, this
might be a good alternate.

> Do you need ssh connectivity between the nodes across these data centers?
> 


Definitely not.  SSH is only used in the wrapper scripts to start the HDFS daemons.  It's
a usability crutch for smaller clusters that don't have proper management.

If your ops folks don't have a better way to manage what is running on your cluster, fire
them.

Brian
Mime
View raw message