hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5761) JobTracker and TaskTracker enter infinite loop when TaskTracker reports bad taskid
Date Tue, 19 May 2009 06:44:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710640#action_12710640

Todd Lipcon commented on HADOOP-5761:

I haven't been able to produce this bug (was responding to an email from Lance on core-user),
but if you make the assumption that a task can enter such a state, it seems reasonable that
the JT should send a Kill action for that task.

To me, this looks similar to HADOOP-5374 but not quite the same. The task probably gets into
COMMIT_PENDING state in the same way, but the manifestation was a slightly different stack

2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54311,
call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388, false, true, 5341) from error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
       at org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
       at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
(from the original core-user email from Lance)

In the HADOOP-5374 stack trace, the tip variable in getTasksToSave was non-null, and the NPE
occured inside shouldCommit. In this case, tip is null.

Lance also reported this being as being on 19.1.

Unfortunately I don't know any way to reproduce this, but it seems logical that if a task
somehow enters a bad state (by means some unknown race condition) the JT's interaction with
the TT should restore a proper state by killing that task.

> JobTracker and TaskTracker enter infinite loop when TaskTracker reports bad taskid
> ----------------------------------------------------------------------------------
>                 Key: HADOOP-5761
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5761
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Todd Lipcon
> If the TaskTracker somehow gets into a state where it has a task in COMMIT_PENDING state
that the JobTracker does not know about, the JobTracker will throw NPEs while processing heartbeats.
Due to HADOOP-3987, this causes the JT and TT to enter an infinite heartbeat loop with no
delays, and the TT fails to make progress.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message