hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Day, Phil" <philip....@hp.com>
Subject RE: Running Hadoop across data centers
Date Mon, 11 Jan 2010 18:34:53 GMT
Aside from the "Because it might be possible" - I'm curious as to why would you want to do
this ?

With no real namenode resilience until 0.21 (and even then you'd have to be quiet smart with
the data placement) I can't see a good availability reason for wanting to do this.

Is it because you want/need a cluster that you can't physically fit into one data centre ?

I'd of thought that you might do better to split your data set into two, run identical jobs
in each DC and then process the combined results on one of the clusters.


-----Original Message-----
From: Eric Sammer [mailto:eric@lifeless.net] 
Sent: 11 January 2010 18:16
To: general@hadoop.apache.org
Subject: Re: Running Hadoop across data centers

On 1/11/10 6:10 AM, Antonio Goncalves wrote:
> Hi all,
> I'm reading the excellent Tom White Hadoop guide and there is this sentence
> written :
> "At the time of this writing, Hadoop is not suited for running across data
> centers"
> Do you know if it's still the case ? Do you know what are the technical
> reasons behind that ? We are planning to test MapReduce in several
> scenarios, one being multiple data centers (one in France and the other one
> in Germany with special network lines). Wouldn't that be possible ?

If the network connection between said data centers is low latency
enough to not prevent the heartbeat connections, it should work, but
performance will almost definitely not be good.

There are few (major) concerns with doing this. The performance reason
is that the intermediary files produced by mappers could be potentially
shuffled between data centers. If this is a lot of data (likely) you'll
be killing your connections between facilities. This is one of the
primary reasons it's recommended against. Also, you'll (potentially)
have Hadoop (specifically HDFS) making data block replicas across data
centers which is going to be a lot of traffic.

>From a security perspective, Hadoop is assumed to be running in a
trusted environment. It's not hardened or designed to be exposed to
untrusted networks *at all*. If you did over a public network of any
kind, it would absolutely have to be over a VPN which is going to
further exacerbate the performance issues. You'll undoubtedly also want
to encrypt this communication if your data is remotely sensitive...

Another significant reason not to do this (right now) has to do with the
block assignment scheduling and the potential for getting yourself into
a situation where the VPN or inter-data center connection becomes a
significant scaling issue as well as a single point of failure for the

I'm sure one could attempt to address these issues with very fancy
network topology scripts for HDFS, redundant data center interconnects,
and a ton of money for high speed connections, but I can't see it being
feasible in terms of cost or maintenance.

...but that's just my opinion. ;)

I'm certainly interested to hear if anyone is actually brave enough to
do this in production, though.

Hope this helps.
Eric Sammer

View raw message