hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast
Date Thu, 20 Mar 2008 18:01:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580874#action_12580874
] 

Amar Kamat commented on HADOOP-2119:
------------------------------------

Some comments about the synchronization changes
1) The changes for synchronization are done to avoid the JobTracker locking wherever possible
2) At the JobTracker following are the API's that can be unsynchronized w.r.t JobTracker
{noformat}
  a) getMapTaskReports
  b) getReduceTaskReports
  c) getTaskDiagnostics
  d) getTaskCompletionEvents
{noformat}
3) *a*, *b* and *c* are the APIs for JobClient while *d* is for the reduceTasks
4) *a* and *b* basically locks the JobTracker (then the JobInProgress and then the TaskInProgress)
so that it can get the correct values of {{completes}} (via {{isComplete()}}) while *d* locks
for diagnostic information ({{taskDiagnosticData}}) (via {{taskDiagnosticData()}} , {{generateSingleReport()}}
and {{addDiagnosticInfo()}} ).
5) I made {{completes}} as AtomicInteger. Updates to {{taskDiagnosticData}} is done only after
sync on {{taskDiagnosticData}} i.e the object itself.
6) Also the patch makes sure that data is always correct but it might be stale. For example
when a task Task1 completes the TaskInProgress (via {{TaskInProgress.setSuccessfulTaskid(Task1)}})
there will not be any case the {{isComplete()}} is true and the {{completes(Task1)}} is false.
7) *d* actually need not lock the JobTracker. JobInProgress locking seems sufficient. Removing
the synchronization doesn't affect in any sense.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch,
hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message