hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2014) Job Tracker should prefer input-splits from overloaded racks
Date Fri, 08 Feb 2008 12:24:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567015#action_12567015
] 

Amar Kamat commented on HADOOP-2014:
------------------------------------

I assume that this priority is valid across all the tasks, i.e we should also prefer data-local
speculation over remote ones if at all there is a cached task that requires speculation else
fall back to the current strategy.
_Scenario_
{noformat}
Hosts : Cached Tasks
H1 : T1, T2
H2 : T2, T3
H3 : T3, T4
H4 : T4
{noformat}
_Stages_
{noformat}
1. H1,H2,H3,H4 ask for task and get T1,T2,T3,T4 respectively.
2. H2, H4 are slow and require speculation
3. H3 finished and asks for more, gets T2
4. H1 finishes and asks for more, gets T4
{noformat}
Ideally H3 should get T4 and H1 should get T2, no?
So, the algorithm would be
{code}
1. Find a Runnable && ~Running task
    1.1 Scan the cache but maintain the runnable local tasks.
    1.2 Scan all the tasks to find out a task that has the lowest number of data-local trackers
(and also some load/rack/io/map-slots considerations).
2. Find a task that has failed on all machines // fail early
3. Find a task for speculation
    1.1 Check if there is a local task that can be speculated
    1.2 Scan all the tasks with lowest number of data-local trackers (and also some load/rack/io/map-slots
considerations).
{code} 

Thoughts?

> Job Tracker should prefer input-splits from overloaded racks
> ------------------------------------------------------------
>
>                 Key: HADOOP-2014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2014
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Devaraj Das
>
> Currently, when the Job Tracker assigns a mapper task to a task tracker and there is
no local split to the task tracker, the
> job tracker will find the first runable task in the mast task list  and assign the task
to the task tracker.
> The split for the task is not local to the task tracker, of course. However, the split
may be local to other task trackers.
> Assigning the that task, to that task tracker may decrease the potential number of mapper
attempts with data locality.
> The desired behavior in this situation is to choose a task whose split is not local to
any  task tracker. 
> Resort to the current behavior only if no such task is found.
> In general, it will be useful to know the number of task trackers to which each split
is local.
> To assign a task to a task tracker, the job tracker should first  try to pick a task
that is local to the task tracker  and that has minimal number of task trackers to which it
is local. If no task is local to the task tracker, the job tracker should  try to pick a task
that has minimal number of task trackers to which it is local. 
> It is worthwhile to instrument the job tracker code to report the number of splits that
are local to some task trackers.
> That should be the maximum number of tasks with data locality. By comparing that number
with the the actual number of 
> data local mappers launched, we can know the effectiveness of the job tracker scheduling.
> When we introduce rack locality, we should apply the same principle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message