hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Twensky <jim.twen...@gmail.com>
Subject Tasktracker failing and getting black listed
Date Thu, 23 Dec 2010 22:37:05 GMT
Hi,

I have a 16+1 node hadoop cluster where all tasktrackers (and
datanodes) are connected to the same switch and share the exact same
hardware and software configuration. When I run a hadoop job, one of
the task trackers always produces one of these two errors ONLY during
the reduce tasks and gets blacklisted eventually.

---------------------------------------------------------------------------------
org.apache.hadoop.fs.ChecksumException: Checksum Error
	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
	at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
	at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
	at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
	at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:374)
	at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
	at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
	at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
	at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:111)
	at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:86)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:173)
	at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1214)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)
---------------------------------------------------------------------------------

or

---------------------------------------------------------------------------------
java.lang.RuntimeException: next value iterator failed
	at org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:160)
	at src.expinions.PhraseGen.ReduceClass.reduce(ReduceClass.java:17)
	at src.expinions.PhraseGen.ReduceClass.reduce(ReduceClass.java:10)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
	at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1214)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error
	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
	at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
	at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
	at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
	at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
	at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
	at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
	at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
	at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:111)
	at org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:157
---------------------------------------------------------------------------------

It is always the same node, and it can successfully run the map tasks
without problems. I double checked the available disk space and other
settings and couldn't find anything different. I also tried to run
different jobs and different input but the result is always the same.

Any ideas?

Mime
View raw message