hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast
Date Fri, 21 Mar 2008 22:41:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581206#action_12581206
] 

Amar Kamat commented on HADOOP-2119:
------------------------------------

Taken into consideration Owen's comments. Here is what is done
bq. I really wish that the synchronization changes could be done in another patch ...
+1. Removed all the synchronization changes. Will open another issue regarding the same.
bq. siblingSetAtLevel seems really arcane. I would propose that instead you add getChildren
to the ...
Maintaining this information at Node level might involve more complexity and will require
more testing. A concept of children is already there in NodeBase but looking at the code it
is not very clear what they are for and how to use them. Now there is just a single set of
nodes at {{maxlevel}} maintained at the JobTracker. For now this seems to be a simpler solution.
bq. why is there yet another map from hostname to Node? This is already done in the node mapping.
This is done to incur less penalty during the job execution. While the job is running the
only penalty incurred is for the resolution of datanodes and newly joining trackers while
resolution of trackers (before the job is submitted) is done as a part of heart beat (separate
thread). Without this mapping there is no way to find out the Node given the hostname. Also
I have renamed the variable _trackerNameToNodeMap_ which is there in the trunk. I am also
using it to store the mapping for datanodes mapping too.
bq.  I'm really concerned that we are adding 5 new fields holding collections to the JobInProgress
As I said this is required to get away with the array and also that the total space is somewhat
bounded by the total number of TIPs. Either the TIPs are local or not. Also the TIPs are either
running or not-running. Mostly they move from one list to other. Hence 
_local-maps-non-running + local-maps-running + non-local-maps-non-running + non-local-maps-running
~ total-map-tips_
and
_non-running-reduces + running-reduces ~ total-reduce-tips_.
bq. reducers is a really bad name.
Fixed.
bq. nodesToMaps should be runnableMaps
runnable means !failed && not-completed. Running and non-running both belong to the
runnable category. But I have used a different name for this variable.
bq. Don't use assignment in a parameter to a method in initTasks
Fixed.
bq. I'm bothered by all of the checks for null Nodes that just skip the location.
Fixed. Now there are no null checks.
bq. Shouldn't we remove the node from the nodesToMaps regardless of the level?
Consider a case where _tip1_ fails on _host1_. _host1_ belongs to _rack1_. Now _host1_ runs
out of cached tips and queries _rack1_'s cache. In such a case it should not remove the tip
since some other tracker in the same rack can schedule it.
bq. nodesToMaps being null should be a fatal error
Fixed.
bq. nodesToMaps being null should be a fatal error
Done. In case of misconfiguration (i.e nodesToMaps = null) the JobTracker will give a fatal
error and shutdown.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch,
HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message