hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast
Date Wed, 13 Feb 2008 16:49:08 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568621#action_12568621
] 

Vivek Ratan commented on HADOOP-2119:
-------------------------------------

The only difference between 1a (we need to give that a name - let's call it the 'cache map',
as it's mostly based on how the cache is implemented)  and a sparse matrix is that in the
latter, each TIP is linked to the TIP below it in the same column. And that the linked list
for a row in a spare matrix is doubly linked (you need it to efficiently delete tasks in the
running list) while in 1a, the runnable list is a singly linked list. Given that, I would
vote for the cache map, for the following reasons: 
- My big concern with implementing a sparse matrix now is that you're implementing a brand
new data structure. Given how core this functionality is, and time constraints, it's riskier
to introduce such newness in the code. 
- You already have most of the code in place for cache map, it's been there for a while and
tested in production. That gives me a lot more comfort than putting in brand new code. 
- In terms of performance, the only difference between a linked list for running tasks (2a)
and a sparse matrix is for speculative tasks, where the latter performs better. However, it's
not clear to me how much this will reflect in the overall performance. it seems like the effect
of lower performance of a linked list may be extremely minimal in the overall scheme of things,
so why throw in new code? It's better, iMO, to see whether this performance is indeed significant
before making big changes. 
- As I mentioned earlier, the cache map and sparse matrix are almost identical. I don't see
a sparse matrix being any more simple or elegant than a cache map, i.e., i see both as fairly
simple and elegant structures. 

I agree that a sparse matrix is the better option for speculative tasks, and that it may be
also useful in the future for more complex scheduling decisions,  as Arun points out. However,
because it requires new code and in such a central/core functionality, I'd recommend a more
cautious approach of using the tried&tested code you already have to solve most, if not
all, of the problems you're facing today, and looking at a sparse matrix implementation if
the need is great. New code always brings it problems with testing and implementation and
potential side effects. 

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message