hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack@archive.org (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-82) JobTracker loses it: NoSuchElementException
Date Wed, 15 Mar 2006 17:15:58 GMT
JobTracker loses it: NoSuchElementException
-------------------------------------------

         Key: HADOOP-82
         URL: http://issues.apache.org/jira/browse/HADOOP-82
     Project: Hadoop
        Type: Bug
    Reporter: stack@archive.org
    Priority: Minor


On a number of occasions, JobTracker goes into a loop that it never recovers from.  Over and
over it prints the below to the jobtracker log.

060304 124522 Server handler 5 on 8010 call error: java.io.IOException: java.util.NoSuchElementException
java.io.IOException: java.util.NoSuchElementException
   at java.util.TreeMap.key(TreeMap.java:433)
   at java.util.TreeMap.firstKey(TreeMap.java:287)
   at java.util.TreeSet.first(TreeSet.java:407)
   at org.apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java:428Timed out.org.apache.hadoop.fs.ChecksumException:
Checksum error:/2/hadoop/nara/data/tmp/task_r_m80hob/all.1 at 1554810368 at
org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:122)
at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:158)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
java.io.BufferedInputStream.read(BufferedInputStream.java:235) at
org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:210)
at java.io.DataInputStream.readInt(DataInputStream.java:353) at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:367) at
org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:557)
at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:523)
at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:511)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:254) at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)

I added debug to TIP#getTaskToRun so i could tell which TIP had empted its allotment of tasks.
 Below is extract from jobtracker log that shows sequence of events for TIP tip_fizr7m that
lead up to JT losing it:

060314 203637 Adding task 'task_m_4d6ht0' to tip tip_fizr7m, for tracker 'tracker_41791' on
ia109314.archive.org 
060314 204758 Task 'task_m_4d6ht0' has been lost.
060314 204811 Adding task 'task_m_fb0wf0' to tip tip_fizr7m, for tracker 'tracker_70065' on
ia109314.archive.org
060314 210118 Task 'task_m_fb0wf0' has been lost.
060314 210119 Adding task 'task_m_irar47' to tip tip_fizr7m, for tracker 'tracker_82285' on
ia109324.archive.org
060314 211541 Taskid 'task_m_irar47' has finished successfully.
060314 211541 Task 'task_m_irar47' has completed.
060314 211543 Adding task 'task_m_qo1g69' to tip tip_fizr7m, for tracker 'tracker_97839' on
ia109306.archive.org
060314 213004 Taskid 'task_m_qo1g69' has finished successfully.
060314 213004 Task 'task_m_qo1g69' has completed.
060314 213005 Adding task 'task_m_t0lnzk' to tip tip_fizr7m, for tracker 'tracker_57273' on
ia109314.archive.org
060314 214118 Task 'task_m_t0lnzk' has been lost.

So, we lose two, complete two, then lose a third.

TIP should have been done on first completion.

TIP accounting is off.




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message