hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3370) failed tasks may stay forever in TaskTracker.runningJobs
Date Fri, 09 May 2008 18:24:56 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595686#action_12595686

Zheng Shao commented on HADOOP-3370:

Details about a potential solution:
1. On failed task, remove the task from runningJobs, but do not delete runningJobs job entry
even if it's the only task of the job;  (which means we should NOT call TaskTracker.removeTaskFromJob)

2. JobTracker should keep another data structure: jobsToTracker, for recording all the TaskTrackers
that a job has started a task on.

3. When the job finished, JobTracker will send "KILL" job command to the TaskTrackers, based
on jobsToTracker data structure.

An alternative:
On failed task, remove the task from runningJobs, AND if it's the only task of the job, remove
the job directory (which means we should call TaskTracker.removeTaskFromJob, PLUS delete the
job directory)

> failed tasks may stay forever in TaskTracker.runningJobs
> --------------------------------------------------------
>                 Key: HADOOP-3370
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3370
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Zheng Shao
>            Priority: Critical
> The net effect of this is that, with a long-running TaskTracker, it takes long long time
for ReduceTasks on that TaskTracker to fetch map outputs - TaskTracker does that for all reduce
tasks in TaskTracker .runningJobs, including those stale ReduceTasks. There is a 5-second
delay between 2 requests, which makes it a long time for a running reducetask to get the map
output locations, when there are tens of stale ReduceTasks. Of course this also blows up the
memory but that is not a too big problem at its rate.
> I've verified the bug by adding an html table for TaskTracker.runningJobs on TaskTracker
http interface, on a 2-node machine, with a single mapper single reducer job, in which mapper
succeeds and reducer fails. I can still see the ReduceTask in TaskTracker.runningJobs, while
it's not in the first 2 tables (TaskTracker.tasks and TaskTracker.runningTasks).
> Details:
> TaskRunner.run() will call TaskTracker.reportTaskFinished() when the task fails,
> which calls TaskTracker.TaskInProgress.taskFinished,
> which calls TaskTracker.TaskInProgress.cleanup(),
> which calls TaskTracker.tasks.remove(taskId).
> In short, it remove a failed task from TaskTracker.tasks, but not TaskTracker.runningJobs.
> Then the failure is reported to JobTracker.
> JobTracker.heartbeat will call processHeartbeat, 
> which calls updateTaskStatuses, 
> which calls tip.getJob().updateTaskStatus, 
> which calls JobInProgress.failedTask,
> which calls JobTracker.markCompletedTaskAttempt, 
> which puts the task to trackerToMarkedTasksMap, 
> and then JobTracker.heartbeat will call removeMarkedTasks,
> which call removeTaskEntry, 
> which removes it from trackerToTaskMap.
> JobTracker.heartbeat will also call JobTracker.getTasksToKill,
> which reads from trackerToTaskMap for <tracker, task> pairs,
> and ask tracker to KILL the task or job of the task.
> In the case there is only one task for a specific job on a specific tracker 
> and that task failed (NOTE: and that task is not the last failed try of the
> job - otherwise JobTracker.getTasksToKill will pick it up before 
> removeMarkedTasks comes in and remove it from trackerToTaskMap), the task 
> tracker will not receive the KILL task or KILL job message from the JobTracker.
> As a result, the task will remain in TaskTracker.runningJobs forever.
> Solution:
> Remove the task from TaskTracker.runningJobs at the same time when we remove it from

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message