hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kuzmin <andrey.v.kuz...@gmail.com>
Subject Re: Running Hadoop across data centers
Date Wed, 13 Jan 2010 19:09:05 GMT
On Tue, Jan 12, 2010 at 7:03 PM, Eric Sammer <eric@lifeless.net> wrote:
> On 1/12/10 6:01 AM, Antonio Goncalves wrote:
>> Thanks Eric and Phil for your inputs.
>>
>> We have 80% of our calculation that can be done in one datacenter, but the
>> rest is heavy calculation. We are using some time consuming algorithm (such
>> as Monte Carlo for example) that would take too much time in one datacenter.
>> For this kind of computation we are thinking of using the second datacenter
>> based in Germany. We haven't done all the study about the data, but I guess
>> that for the 80% the data will be local to one datacenter, and for the 20%
>> it would have to be distributed across datacenter. What we haven't worked on
>> yet, is the size of this distributed data. If looks it would not be that big
>> (maybe less than a 1Tb but it could grow when doing some calculation based
>> on archived data).
>
> Antonio:
>
> The point here is that if you build one logical Hadoop cluster across
> two data centers, then Hadoop will consider all nodes as candidates for
> receiving work regardless of which data center the job is started in.

Is Hadoop's job scheduler totally NUMA-unaware? The question holds for
both cross-data center scenario being discussed and or single data
center as well: just imagine the usual within-rack or between-racks
scheduling decision.

Regards,
Andrey

> This means that there's no guarantees about how much data will be
> shuffled between data centers (during the shuffle phase). The same is
> true for the HDFS layer - there's no promise that data won't be
> shuffling between data centers for replication; in fact, it's likely it
> will.
>
> If you want to make sure compute intensive jobs run in data center A
> you'll probably have better luck creating two logical Hadoop clusters,
> each confined to a data center, and then simply making sure the proper
> data sets are available in each Hadoop cluster. This way, there will be
> no unbounded, uncontrolled data transfer between data centers and job
> performance will not suffer.
>
> The down side is that you may have to copy data from one data center to
> another if the data for job running in data center A is produced in data
> center B.
>
> I would definitely watch the Cloudera Hadoop MapReduce and HDFS training
> video[1] and look over the information on the Hadoop wiki and web site
> prior to doing anything at all. It may clear a few things up.
>
> Also, I highly recommend the book Hadoop: The Definitive Guide by Tom
> White / O'Reilly[2] if you haven't read it. It has tons of excellent
> information about the map reduce process and how it works under the hood.
>
> [1] - http://www.cloudera.com/hadoop-training-mapreduce-hdfs
> [2] - http://oreilly.com/catalog/9780596521981
>
> --
> Eric Sammer
> eric@lifeless.net
> http://esammer.blogspot.com
>

Mime
View raw message