hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5761) JobTracker and TaskTracker enter infinite loop when TaskTracker reports bad taskid
Date Thu, 30 Apr 2009 19:05:30 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704749#action_12704749

Todd Lipcon commented on HADOOP-5761:

In addition to fixing HADOOP-3987 to avoid the infinite loop issue, I think there is a better
recovery option here. I would propose something like:

private synchronized List<TaskTrackerAction> getKillActionsForBadTasks(TaskTrackerStatus
tts) {
  List<TaskStatus> taskStatuses = tts.getTaskReports();
  List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>();
  if (taskStatuses != null) {
    for (TaskStatus taskStatus : taskStatuses) {
      TaskAttemptID taskId = taskStatus.getTaskID();
      if (! taskidToTIPMap.containsKey(taskId)) {
        LOG.info("Task Tracker reported status on ID " + taskId + " unknown to JobTracker.
Killing task.");
        actions.add(new KillTaskAction(taskId));
  return actions;

Then fix getTasksToSave (and other instances of the issue) to add "tip != null" checks to
their if statements.

Does this sound like a reasonable recovery strategy?

> JobTracker and TaskTracker enter infinite loop when TaskTracker reports bad taskid
> ----------------------------------------------------------------------------------
>                 Key: HADOOP-5761
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5761
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Todd Lipcon
> If the TaskTracker somehow gets into a state where it has a task in COMMIT_PENDING state
that the JobTracker does not know about, the JobTracker will throw NPEs while processing heartbeats.
Due to HADOOP-3987, this causes the JT and TT to enter an infinite heartbeat loop with no
delays, and the TT fails to make progress.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message