hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <e...@lifeless.net>
Subject Re: Running Hadoop across data centers
Date Wed, 13 Jan 2010 19:31:06 GMT
On 1/13/10 2:09 PM, Andrey Kuzmin wrote:
> On Tue, Jan 12, 2010 at 7:03 PM, Eric Sammer <eric@lifeless.net> wrote:
>> On 1/12/10 6:01 AM, Antonio Goncalves wrote:
>>> Thanks Eric and Phil for your inputs.
>>>
>>> We have 80% of our calculation that can be done in one datacenter, but the
>>> rest is heavy calculation. We are using some time consuming algorithm (such
>>> as Monte Carlo for example) that would take too much time in one datacenter.
>>> For this kind of computation we are thinking of using the second datacenter
>>> based in Germany. We haven't done all the study about the data, but I guess
>>> that for the 80% the data will be local to one datacenter, and for the 20%
>>> it would have to be distributed across datacenter. What we haven't worked on
>>> yet, is the size of this distributed data. If looks it would not be that big
>>> (maybe less than a 1Tb but it could grow when doing some calculation based
>>> on archived data).
>>
>> Antonio:
>>
>> The point here is that if you build one logical Hadoop cluster across
>> two data centers, then Hadoop will consider all nodes as candidates for
>> receiving work regardless of which data center the job is started in.
> 
> Is Hadoop's job scheduler totally NUMA-unaware? The question holds for
> both cross-data center scenario being discussed and or single data
> center as well: just imagine the usual within-rack or between-racks
> scheduling decision.

Andrey:

Take a look at how replicas are assigned to data nodes[1] to see how
blocks are distributed. During M/R the job tracker will assign a map
task to a tracker where the input split is "as close to the data as
possible." Close is either data local (where the task tracker is running
on the same machine as the data), rack local (task tracker on a machine
in the same rack as the data), or not local at all. What I was saying is
that the third case is always possible (and undesirable).

[1]
http://hadoop.apache.org/common/docs/current/hdfs_design.html#Replica+Placement%3A+The+First+Baby+Steps

-- 
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com

Mime
View raw message