hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Dahiya (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-737) TaskTracker's job cleanup loop should check for finished job before deleting local directories
Date Mon, 20 Nov 2006 12:47:02 GMT
TaskTracker's job cleanup loop should check for finished job before deleting local directories


                 Key: HADOOP-737
                 URL: http://issues.apache.org/jira/browse/HADOOP-737
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Sanjay Dahiya
         Assigned To: Sanjay Dahiya
            Priority: Critical

TaskTracker  uses jobClient.pollForTaskWithClosedJob() to find tasks which should be closed.
This mechanism doesnt pass the information on whether the job is really finished or the task
is being killed for some other reason( speculative instance succeeded). Since Tasktracker
doesnt know this state it assumes job is finished and deletes local job dir, causing any subsequent
tasks on the same task tracker for same job to fail with job.xml not found exception as reported
in HADOOP-546 and possibly in HADOOP-543. This causes my patch for HADOOP-76 to fail for a
large number of reduce tasks in some cases.
Same causes extra exceptions in logs while a job is being killed, the first task that gets
closed will delete local directories and any other tasks (if any) which are about to get launched
will throw this exception. In this case it is less significant is as the job is killed anyways
and only logs get extra exceptions. 

Possible solutions : 
1. Add an extra method in InetTrackerProtocol for checking for job status before deleting
local directory. 
2. Set TaskTracker.RunningJob.localized to false once the local directory is deleted so that
new tasks don't look for it there. 

There is clearly a race condition in this and logs may still get the exception while shutdown
but in normal cases it would work. 

Comments ? 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message