hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <e...@lifeless.net>
Subject Re: Running Hadoop across data centers
Date Mon, 11 Jan 2010 18:15:35 GMT
On 1/11/10 6:10 AM, Antonio Goncalves wrote:
> Hi all,
> 
> I'm reading the excellent Tom White Hadoop guide and there is this sentence
> written :
> 
> "At the time of this writing, Hadoop is not suited for running across data
> centers"
> 
> Do you know if it's still the case ? Do you know what are the technical
> reasons behind that ? We are planning to test MapReduce in several
> scenarios, one being multiple data centers (one in France and the other one
> in Germany with special network lines). Wouldn't that be possible ?

If the network connection between said data centers is low latency
enough to not prevent the heartbeat connections, it should work, but
performance will almost definitely not be good.

There are few (major) concerns with doing this. The performance reason
is that the intermediary files produced by mappers could be potentially
shuffled between data centers. If this is a lot of data (likely) you'll
be killing your connections between facilities. This is one of the
primary reasons it's recommended against. Also, you'll (potentially)
have Hadoop (specifically HDFS) making data block replicas across data
centers which is going to be a lot of traffic.

>From a security perspective, Hadoop is assumed to be running in a
trusted environment. It's not hardened or designed to be exposed to
untrusted networks *at all*. If you did over a public network of any
kind, it would absolutely have to be over a VPN which is going to
further exacerbate the performance issues. You'll undoubtedly also want
to encrypt this communication if your data is remotely sensitive...

Another significant reason not to do this (right now) has to do with the
block assignment scheduling and the potential for getting yourself into
a situation where the VPN or inter-data center connection becomes a
significant scaling issue as well as a single point of failure for the
cluster.

I'm sure one could attempt to address these issues with very fancy
network topology scripts for HDFS, redundant data center interconnects,
and a ton of money for high speed connections, but I can't see it being
feasible in terms of cost or maintenance.

...but that's just my opinion. ;)

I'm certainly interested to hear if anyone is actually brave enough to
do this in production, though.

Hope this helps.
-- 
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com

Mime
View raw message