hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: [jira] Commented: (HADOOP-573) Checksum error during sorting in reducer
Date Tue, 29 May 2007 13:53:41 GMT
Trust me when I tell you that this is a hardware problem, most likely a 
memory problem.  If you are not using ECC memory then it is recommended 
that ECC memory be used.  We had these same problems over a cluster of 
50 machines and all problems disappeared when ECC memory was used.  We 
spent over a month tracking down and fixing these problems including 
spending over 2 weeks combing through the hadoop source code convinced 
that it was a software bug.  We ran hardware tests on hard disk, cpu, 
and memory (which all passed), we tried upgrading to the latest version 
of hadoop, and we tried turning off checksum validation in the 
configuration.  Errors persisted.  Hadoop can be especially hard on 
memory, most often during sorts, and I don't know why but "good" Non-ECC 
memory can sometimes fail intermittently during these jobs.  The memory 
can test out under other load but fail during hadoop jobs.

What you should be seeing is, depending on the number of machines that 
you are running, while the errors occur on many machines they happen 
more often on some machines.  They should also be happening most often 
during major sorting jobs, such as mergesegs and merge crawldb (if you 
are using Nutch).  If you take one of the machines where the errors 
occur frequently and change out its memory for ECC memory you will see 
your errors decrease by an order of magnitude.  Do it for all of your 
machines and the problems will disappear completely.  If changing out 
for ECC on a single machine doesn't decrease your error rate then I 
still think it is a hardware problem and would start looking at the hard 
disk or motherboard for that machine.

We would have 20-30 failures for checksums during major jobs.  Now we 
haven't had a single checksum error in 2-3 weeks of processing (and we 
have been running a major job continuously for the last 5 days).  The 
bad news is this.  If you don't change out for ECC memory and you do 
have a major hardware issue somewhere, you can end up with corruption on 
all replications for a given block.  This occurred with our system 
shortly before changing out all memory to ECC.  With that block we lost 
approximately 20-30K urls.  If hadoop hadn't implemented the continue 
job even if a single map task fails functionality (only recently 
implemented and thank you for that) over 5M urls would have been lost. 
If this had happened in one of our master  databases we could have lost 
the master.  My point is this, hardware problems are hard to track down 
and if they get bad enough can cause a complete failure of the cluster.

If you haven't already done so I suggest searching the hadoop users list 
for checksum errors.  You will find a thread detailing out trials.  Hope 
this helps.

Dennis Kubes

Julian Neil (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/HADOOP-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499677
> Julian Neil commented on HADOOP-573:
> ------------------------------------
> I've been seeing the same problem, and other checksum problems. I am somewhat sceptical
of the suggestion that is a memory hardware issue, but to be thorough I tried replacing my
memory.  The errors continued. If there is any additional information I can provide to help
track the probem down, please let me know.  
> Running on a single Windows Server 2003 (with cygwin) as both namenode and datanode.
> Strangely, some large map/reduce jobs never get checksum errors in the maps or reduces,
but one particular job always does. 
> In addition I have been getting many lost map outputs due to checksum errors.  The error
usually disappears when the task is retried:
> Map output lost, rescheduling: getMapOutput(task_0008_m_000007_0,0) failed :
> org.apache.hadoop.fs.ChecksumException: Checksum error: /tmp/hadoop-sshd_server/mapred/local/task_0008_m_000007_0/file.out
at 60215808
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
> 	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
> 	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> 	at java.io.DataInputStream.read(DataInputStream.java:132)
> 	at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1674)
> 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> 	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
> 	at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
> 	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
> 	at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
> 	at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
> 	at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
> 	at org.mortbay.http.HttpServer.service(HttpServer.java:954)
> 	at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
> 	at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
> 	at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
> 	at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
> 	at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
> 	at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
> I'm alse getting errors in final output of the previous map/reduce job which is fed in
as input to the next job.  These errors do not disappear when the map task retries:
> org.apache.hadoop.fs.ChecksumException: Checksum error: hdfs://xxx.xxx.xxx:9900/aa/datamining/deviations_part-00002_step-00001/part-00000
at 13781504
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
> 	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
> 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> 	at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:93)
> 	at java.io.DataInputStream.readInt(DataInputStream.java:372)
> 	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1523)
> 	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
> 	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
> 	at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73)
> 	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
>> Checksum error during sorting in reducer
>> ----------------------------------------
>>                 Key: HADOOP-573
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-573
>>             Project: Hadoop
>>          Issue Type: Bug
>>          Components: mapred
>>            Reporter: Runping Qi
>>            Assignee: Owen O'Malley
>> Many reduce tasks got killed due to checksum error. The strange thing is that the
file was generated by the sort function, and was on a local disk. Here is the stack: 
>> Checksum error:  ../task_0011_r_000140_0/all.2.1 at 5342920704
>> 	at org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:134)
>> 	at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:110)
>> 	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:170)
>> 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>> 	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>> 	at java.io.DataInputStream.readFully(DataInputStream.java:176)
>> 	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
>> 	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
>> 	at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1061)
>> 	at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:1126)
>> 	at org.apache.hadoop.io.SequenceFile$Reader.nextRaw(SequenceFile.java:1354)
>> 	at org.apache.hadoop.io.SequenceFile$Sorter$MergeStream.next(SequenceFile.java:1880)
>> 	at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:1938)
>> 	at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:1802)
>> 	at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:1749)
>> 	at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1494)
>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:240)
>> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1066)

View raw message