hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Virajith Jalaparti <virajit...@gmail.com>
Subject Re: Lack of data locality in Hadoop-0.20.2
Date Tue, 12 Jul 2011 17:02:42 GMT
I am attaching the config files I was using for these runs with this email.
I am not sure if something in them is causing this non-data locality of


On Tue, Jul 12, 2011 at 3:36 PM, Virajith Jalaparti <virajith.j@gmail.com>wrote:

> I am using a replication factor of 1 since I dont to incur the overhead of
> replication and I am not much worried about reliability.
> I am just using the default Hadoop scheduler (FIFO, I think!). In case of a
> single rack, rack-locality doesn't really have any meaning. Obviously
> everything will run in the same rack. I am concerned about data-local maps.
> I assumed that Hadoop would do a much better job at ensuring data-local maps
> but it doesnt seem to be the case here.
> -Virajith
> On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy <acm@hortonworks.com>wrote:
>> Why are you running with replication factor of 1?
>> Also, it depends on the scheduler you are using. The CapacityScheduler in
>> 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with
>> FairScheduler.
>> IAC, running on a single rack with replication of 1 implies rack-locality
>> for all tasks which, in most cases, is good enough.
>> Arun
>> On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:
>> > Hi,
>> >
>> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
>> input data using a 20 node cluster of nodes. HDFS is configured to use 128MB
>> block size (so 1600maps are created) and a replication factor of 1 is being
>> used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth
>> value of 50Mbps between each of the nodes (this was configured using linux
>> "tc"). I see that around 90% of the map tasks are reading data over the
>> network i.e. most of the map tasks are not being scheduled at the nodes
>> where the data to be processed by them is located.
>> > My understanding was that Hadoop tries to schedule as many data-local
>> maps as possible. But in this situation, this does not seem to happen. Any
>> reason why this is happening? and is there a way to actually configure
>> hadoop to ensure the maximum possible node locality?
>> > Any help regarding this is very much appreciated.
>> >
>> > Thanks,
>> > Virajith

View raw message