hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <e...@lifeless.net>
Subject Re: Running Hadoop across data centers
Date Tue, 12 Jan 2010 16:03:25 GMT
On 1/12/10 6:01 AM, Antonio Goncalves wrote:
> Thanks Eric and Phil for your inputs.
> 
> We have 80% of our calculation that can be done in one datacenter, but the
> rest is heavy calculation. We are using some time consuming algorithm (such
> as Monte Carlo for example) that would take too much time in one datacenter.
> For this kind of computation we are thinking of using the second datacenter
> based in Germany. We haven't done all the study about the data, but I guess
> that for the 80% the data will be local to one datacenter, and for the 20%
> it would have to be distributed across datacenter. What we haven't worked on
> yet, is the size of this distributed data. If looks it would not be that big
> (maybe less than a 1Tb but it could grow when doing some calculation based
> on archived data).

Antonio:

The point here is that if you build one logical Hadoop cluster across
two data centers, then Hadoop will consider all nodes as candidates for
receiving work regardless of which data center the job is started in.
This means that there's no guarantees about how much data will be
shuffled between data centers (during the shuffle phase). The same is
true for the HDFS layer - there's no promise that data won't be
shuffling between data centers for replication; in fact, it's likely it
will.

If you want to make sure compute intensive jobs run in data center A
you'll probably have better luck creating two logical Hadoop clusters,
each confined to a data center, and then simply making sure the proper
data sets are available in each Hadoop cluster. This way, there will be
no unbounded, uncontrolled data transfer between data centers and job
performance will not suffer.

The down side is that you may have to copy data from one data center to
another if the data for job running in data center A is produced in data
center B.

I would definitely watch the Cloudera Hadoop MapReduce and HDFS training
video[1] and look over the information on the Hadoop wiki and web site
prior to doing anything at all. It may clear a few things up.

Also, I highly recommend the book Hadoop: The Definitive Guide by Tom
White / O'Reilly[2] if you haven't read it. It has tons of excellent
information about the map reduce process and how it works under the hood.

[1] - http://www.cloudera.com/hadoop-training-mapreduce-hdfs
[2] - http://oreilly.com/catalog/9780596521981

-- 
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com

Mime
View raw message