hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: The location of the map execution
Date Sun, 04 Mar 2012 12:15:33 GMT
I misspoke in my previous e-mail. The default scheduler does do data
local scheduling, but it's not perfect. When using the default
scheduler, tasks are assigned to TaskTrackers on every heart beat.
When a TaskTracker checks in, the JobTracker will assign any tasks
that are node-local or rack-local. When you run a job with a single
map task, it's very likely that a rack-local TaskTracker will become
available before a node-local one does. This means that for jobs with
a small task count, you're less likely to get data locality. For jobs
with a task count close to or greater than the number of TaskTrackers,
you're much more likely to get node-local assignments.

-Joey

On Sat, Mar 3, 2012 at 10:44 PM, Mohit Anchlia <mohitanchlia@gmail.com> wrote:
> On Sat, Mar 3, 2012 at 7:41 PM, Joey Echeverria <joey@cloudera.com> wrote:
>>
>> Sorry, I meant have you set the mapred.jobtracker.taskScheduler
>> property in your mapred-site.xml file. If not, you're using the
>> standard, FIFO scheduler. The default scheduler doesn't do data-local
>> scheduling, but the fair scheduler and capacity scheduler do. You want
>> to set mapred.jobtracker.taskScheduler to either
>> org.apache.hadoop.mapred.FairScheduler (for the fair scheduler) or
>> org.apache.hadoop.mapred.CapacityTaskScheduler (for the capacity
>> scheduler) and then restart the JobTracker. You can read about the two
>> schedulers here:
>>
>> http://hadoop.apache.org/common/docs/current/fair_scheduler.html
>> http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
>>
>
> I thought by default tasks are scheduled on those nodes that have those data
> blocks. I thought that was inherent. In the faire scheduler link I don't see
> anything about data-local
>
>> -Joey
>>
>> On Sat, Mar 3, 2012 at 6:32 PM, Hassen Riahi <hassen.riahi@cern.ch> wrote:
>> > The jobtracker is running in another machine (node C)
>> >
>> > Hassen
>> >
>> >
>> >> Which scheduler are you using?
>> >>
>> >> -Joey
>> >>
>> >> On Mar 3, 2012, at 18:52, Hassen Riahi <hassen.riahi@cern.ch> wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> We tried using mapreduce to execute a simple map code which read a txt
>> >>> file stored in HDFS and write then the output.
>> >>> The file to read is a very small one. It was not split and written
>> >>> entirely and only in a single datanode (node A). This node is
>> >>> configured
>> >>> also as a tasktracker node
>> >>> While we was expecting that the location of the map execution is node
>> >>> A
>> >>> (since the input is stored there), from log files, we see that the map
>> >>> was
>> >>> executed in another tasktracker (node B) of the cluster.
>> >>> Am I missing something?
>> >>>
>> >>> Thanks for the help!
>> >>> Hassen
>> >>>
>> >
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Mime
View raw message