hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julian Neil (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-573) Checksum error during sorting in reducer
Date Tue, 29 May 2007 04:14:15 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499677
] 

Julian Neil commented on HADOOP-573:
------------------------------------

I've been seeing the same problem, and other checksum problems. I am somewhat sceptical of
the suggestion that is a memory hardware issue, but to be thorough I tried replacing my memory.
 The errors continued. If there is any additional information I can provide to help track
the probem down, please let me know.  
Running on a single Windows Server 2003 (with cygwin) as both namenode and datanode.
Strangely, some large map/reduce jobs never get checksum errors in the maps or reduces, but
one particular job always does. 
 
In addition I have been getting many lost map outputs due to checksum errors.  The error usually
disappears when the task is retried:

Map output lost, rescheduling: getMapOutput(task_0008_m_000007_0,0) failed :
org.apache.hadoop.fs.ChecksumException: Checksum error: /tmp/hadoop-sshd_server/mapred/local/task_0008_m_000007_0/file.out
at 60215808
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	at java.io.DataInputStream.read(DataInputStream.java:132)
	at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1674)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
	at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
	at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
	at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
	at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
	at org.mortbay.http.HttpServer.service(HttpServer.java:954)
	at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
	at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
	at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
	at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
	at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
	at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

I'm alse getting errors in final output of the previous map/reduce job which is fed in as
input to the next job.  These errors do not disappear when the map task retries:

org.apache.hadoop.fs.ChecksumException: Checksum error: hdfs://xxx.xxx.xxx:9900/aa/datamining/deviations_part-00002_step-00001/part-00000
at 13781504
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
	at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:93)
	at java.io.DataInputStream.readInt(DataInputStream.java:372)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1523)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
	at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73)
	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


> Checksum error during sorting in reducer
> ----------------------------------------
>
>                 Key: HADOOP-573
>                 URL: https://issues.apache.org/jira/browse/HADOOP-573
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Owen O'Malley
>
> Many reduce tasks got killed due to checksum error. The strange thing is that the file
was generated by the sort function, and was on a local disk. Here is the stack: 
> Checksum error:  ../task_0011_r_000140_0/all.2.1 at 5342920704
> 	at org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:134)
> 	at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:110)
> 	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:170)
> 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> 	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:176)
> 	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
> 	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
> 	at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1061)
> 	at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:1126)
> 	at org.apache.hadoop.io.SequenceFile$Reader.nextRaw(SequenceFile.java:1354)
> 	at org.apache.hadoop.io.SequenceFile$Sorter$MergeStream.next(SequenceFile.java:1880)
> 	at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:1938)
> 	at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:1802)
> 	at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:1749)
> 	at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1494)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:240)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1066)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message