hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast
Date Fri, 08 Feb 2008 03:43:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566901#action_12566901

Owen O'Malley commented on HADOOP-2119:

I think we can do better than that, by using a special data structure that isn't that complicated.

I propose that we use a 2-d sparse matrix, where each row is location (node or rack) and the
columns correspond to a task in progress (TIP) that is currently runnable, but not running.
I'd make the rows a doubly linked circular list and the columns a singularly linked circular
list. So let's say the operations are:

class LocationTable {
  // add to the front of the lists for all of the locations
  public void addToFront(TaskInProgress tip, String[] locations);
  // add it to the back of the lists at all of the locations
  public void addToBack(TaskInProgress tip, String[] locations);
  // get the first task in the given location and remove it from all of the lists
  public TaskInProgress getFront(String location);

All of the locations involve doing a look up to find the list and a O(1) operation to modify
all of the lists. *Doing deletes out of a doubly linked list is very fast.* If we use a hash
table from the location name to the front of the list for that location, then the lookup is
also O(1).

I think we should solve HADOOP-2014 at the same time http://issues.apache.org/jira/browse/HADOOP-2014?focusedCommentId=12566814#action_12566814

So the order would be:
  1. Look at the node local list O(1)
  2. Look at the rack local list O(1)
  3. Look at the most overloaded rack from HADOOP-2014 O(# racks)

Between the 3 of them, you'll always find a task if there are any to run. Update for all of
the lists is O(1), regardless of how you found it.

When tasks fail, you put them back at the front of all of the relevant lists.

Which leaves the question of speculative execution... I suspect a LocationTable with the currently
running tasks would work pretty well.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message