hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3136) Assign multiple tasks per TaskTracker heartbeat
Date Fri, 11 Jul 2008 18:56:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612961#action_12612961
] 

Arun C Murthy commented on HADOOP-3136:
---------------------------------------

The problem with today's behaviour is multi-fold:
1. This is a utilization bottleneck, especially when the TaskTracker just starts up. We should
be assigning atleast 50% of it's capacity.
2. If the individual tasks are very short i.e. run for less than the heartbeat interval the
TaskTracker serially runs one task at a time.
3. For jobs with small maps, the TaskTracker never gets a chance to schedule reduces till
all maps are complete. This means shuffle doesn't overlap with maps at all, another sore-point.

Overall, the right approach is to let the TaskTracker advertise the number of available map
and reduce slots in each heartbeat and the JobTracker (i.e the Scheduler - HADOOP-3412/HADOOP-3445)
should decide how many tasks and which maps/reduces the TaskTracker should be assigned. Also,
we should ensure that the TaskTracker doesn't run to the JobTracker every-time a task completes
- maybe we should hard-limit to the heartbeat interval or maybe run to the JobTracker when
there are more than one completed tasks in a given heartbeat interval etc. 

> Assign multiple tasks per TaskTracker heartbeat
> -----------------------------------------------
>
>                 Key: HADOOP-3136
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3136
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>
> In today's logic of finding a new task, we assign only one task per heartbeat.
> We probably could give the tasktracker multiple tasks subject to the max number of free
slots it has - for maps we could assign it data local tasks. We could probably run some logic
to decide what to give it if we run out of data local tasks (e.g., tasks from overloaded racks,
tasks that have least locality, etc.). In addition to maps, if it has reduce slots free, we
could give it reduce task(s) as well. Again for reduces we could probably run some logic to
give more tasks to nodes that are closer to nodes running most maps (assuming data generated
is proportional to the number of maps). For e.g., if rack1 has 70% of the input splits, and
we know that most maps are data/rack local, we try to schedule ~70% of the reducers there.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message