hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Goncalves <antonio.mail...@gmail.com>
Subject Re: Running Hadoop across data centers
Date Tue, 12 Jan 2010 11:01:50 GMT
Thanks Eric and Phil for your inputs.

We have 80% of our calculation that can be done in one datacenter, but the
rest is heavy calculation. We are using some time consuming algorithm (such
as Monte Carlo for example) that would take too much time in one datacenter.
For this kind of computation we are thinking of using the second datacenter
based in Germany. We haven't done all the study about the data, but I guess
that for the 80% the data will be local to one datacenter, and for the 20%
it would have to be distributed across datacenter. What we haven't worked on
yet, is the size of this distributed data. If looks it would not be that big
(maybe less than a 1Tb but it could grow when doing some calculation based
on archived data).


2010/1/11 Day, Phil <philip.day@hp.com>

> Aside from the "Because it might be possible" - I'm curious as to why would
> you want to do this ?
> With no real namenode resilience until 0.21 (and even then you'd have to be
> quiet smart with the data placement) I can't see a good availability reason
> for wanting to do this.
> Is it because you want/need a cluster that you can't physically fit into
> one data centre ?
> I'd of thought that you might do better to split your data set into two,
> run identical jobs in each DC and then process the combined results on one
> of the clusters.
> Phil
> -----Original Message-----
> From: Eric Sammer [mailto:eric@lifeless.net]
> Sent: 11 January 2010 18:16
> To: general@hadoop.apache.org
> Subject: Re: Running Hadoop across data centers
> On 1/11/10 6:10 AM, Antonio Goncalves wrote:
> > Hi all,
> >
> > I'm reading the excellent Tom White Hadoop guide and there is this
> sentence
> > written :
> >
> > "At the time of this writing, Hadoop is not suited for running across
> data
> > centers"
> >
> > Do you know if it's still the case ? Do you know what are the technical
> > reasons behind that ? We are planning to test MapReduce in several
> > scenarios, one being multiple data centers (one in France and the other
> one
> > in Germany with special network lines). Wouldn't that be possible ?
> If the network connection between said data centers is low latency
> enough to not prevent the heartbeat connections, it should work, but
> performance will almost definitely not be good.
> There are few (major) concerns with doing this. The performance reason
> is that the intermediary files produced by mappers could be potentially
> shuffled between data centers. If this is a lot of data (likely) you'll
> be killing your connections between facilities. This is one of the
> primary reasons it's recommended against. Also, you'll (potentially)
> have Hadoop (specifically HDFS) making data block replicas across data
> centers which is going to be a lot of traffic.
> From a security perspective, Hadoop is assumed to be running in a
> trusted environment. It's not hardened or designed to be exposed to
> untrusted networks *at all*. If you did over a public network of any
> kind, it would absolutely have to be over a VPN which is going to
> further exacerbate the performance issues. You'll undoubtedly also want
> to encrypt this communication if your data is remotely sensitive...
> Another significant reason not to do this (right now) has to do with the
> block assignment scheduling and the potential for getting yourself into
> a situation where the VPN or inter-data center connection becomes a
> significant scaling issue as well as a single point of failure for the
> cluster.
> I'm sure one could attempt to address these issues with very fancy
> network topology scripts for HDFS, redundant data center interconnects,
> and a ton of money for high speed connections, but I can't see it being
> feasible in terms of cost or maintenance.
> ...but that's just my opinion. ;)
> I'm certainly interested to hear if anyone is actually brave enough to
> do this in production, though.
> Hope this helps.
> --
> Eric Sammer
> eric@lifeless.net
> http://esammer.blogspot.com

Antonio Goncalves (antonio.goncalves@gmail.com)
Software architect

Web site : www.antoniogoncalves.org
Blog: agoncal.wordpress.com
Feed: feeds2.feedburner.com/AntonioGoncalves
Paris JUG leader : www.parisjug.org
LinkedIn: www.linkedin.com/in/agoncal

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message