hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-362) tasks can get lost when reporting task completion to the JobTracker has an error
Date Thu, 13 Jul 2006 12:17:30 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-362?page=comments#action_12420857 ] 

Devaraj Das commented on HADOOP-362:
------------------------------------

Thanks Owen for putting this up on Jira! Well the code snippet that I sent you was a quick
& dirty hack for the problem I was facing. Of course, yours is a much more elaborate solution.
However, with this patch, the problem appears somewhere else - the reduces don't make progress.
Even after all maps finish, the reduces remain stuck at 0% progress.
I haven't yet fully analyzed your patch. I will do that.

> tasks can get lost when reporting task completion to the JobTracker has an error
> --------------------------------------------------------------------------------
>
>          Key: HADOOP-362
>          URL: http://issues.apache.org/jira/browse/HADOOP-362
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>     Reporter: Devaraj Das
>     Assignee: Devaraj Das
>  Attachments: lost-status-updates.patch
>
> Basically, the JobTracker used to lose some updates about successful map tasks and it
would assume that the tasks are still running (the old progress report is what it used to
display in the web page). Now this would cause the reduces to also wait for the map output
and they would never receive the output. This would cause the job to appear as if it was hung.
>  
> The following piece of code sends the status of tasks to the JobTracker:
>  
>             synchronized (this) {
>                 for (Iterator it = runningTasks.values().iterator();
>                      it.hasNext(); ) {
>                     TaskInProgress tip = (TaskInProgress) it.next();
>                     TaskStatus status = tip.createStatus();
>                     taskReports.add(status);
>                     if (status.getRunState() != TaskStatus.RUNNING) {
>                         if (tip.getTask().isMapTask()) {
>                             mapTotal--;
>                         } else {
>                             reduceTotal--;
>                         }
>                         it.remove();
>                     }
>                 }
>             }
>  
>             //
>             // Xmit the heartbeat
>             //
>            
>             TaskTrackerStatus status =
>               new TaskTrackerStatus(taskTrackerName, localHostname,
>                                     httpPort, taskReports,
>                                     failures);
>             int resultCode = jobClient.emitHeartbeat(status, justStarted);
>  
>  
> Notice that the completed TIPs are removed from runningTasks data structure. Now, if
the emitHeartBeat threw an exception (if it could not communicate with the JobTracker till
the IPC timeout expires) then this update is lost. And the next time it sends the hearbeat
this completed task's status is missing and hence the JobTracker doesn't know about this completed
task. So, one solution to this is to remove the completed TIPs from runningTasks after emitHeartbeat
returns. Here is how the new code would look like:
>  
>  
>             synchronized (this) {
>                 for (Iterator it = runningTasks.values().iterator();
>                      it.hasNext(); ) {
>                     TaskInProgress tip = (TaskInProgress) it.next();
>                     TaskStatus status = tip.createStatus();
>                     taskReports.add(status);
>                 }
>             }
>  
>             //
>             // Xmit the heartbeat
>             //
>  
>             TaskTrackerStatus status =
>               new TaskTrackerStatus(taskTrackerName, localHostname,
>                                     httpPort, taskReports,
>                                     failures);
>             int resultCode = jobClient.emitHeartbeat(status, justStarted);
>             synchronized (this) {
>                 for (Iterator it = runningTasks.values().iterator();
>                      it.hasNext(); ) {
>                     TaskInProgress tip = (TaskInProgress) it.next();
>                     if (tip.runstate != TaskStatus.RUNNING) {
>                         if (tip.getTask().isMapTask()) {
>                             mapTotal--;
>                         } else {
>                             reduceTotal--;
>                         }
>                         it.remove();
>                     }
>                 }
>             }
>  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message