hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: Lack of data locality in Hadoop-0.20.2
Date Tue, 12 Jul 2011 14:30:38 GMT
Why are you running with replication factor of 1?

Also, it depends on the scheduler you are using. The CapacityScheduler in 0.20.203 (not 0.20.2)
has much better locality for jobs, similarly with FairScheduler.

IAC, running on a single rack with replication of 1 implies rack-locality for all tasks which,
in most cases, is good enough.


On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

> Hi,
> I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input data using
a 20 node cluster of nodes. HDFS is configured to use 128MB block size (so 1600maps are created)
and a replication factor of 1 is being used. All the 20 nodes are also hdfs datanodes. I was
using a bandwidth value of 50Mbps between each of the nodes (this was configured using linux
"tc"). I see that around 90% of the map tasks are reading data over the network i.e. most
of the map tasks are not being scheduled at the nodes where the data to be processed by them
is located. 
> My understanding was that Hadoop tries to schedule as many data-local maps as possible.
But in this situation, this does not seem to happen. Any reason why this is happening? and
is there a way to actually configure hadoop to ensure the maximum possible node locality?
> Any help regarding this is very much appreciated.
> Thanks,
> Virajith

View raw message