hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: I/O errors reading task output on 20.205.0
Date Thu, 29 Dec 2011 18:18:59 GMT
We just reproduced it (somehow) by running jobs concurrently reading the same 
data. Two out of three similar jobs died early in the map phase with Could not 
obtain block errors, one finished completely. 

java.io.IOException: Could not obtain block: blk_119146860335302651_13067 
file=/user/systems/crawl/crawldb/current/part-00000/data
	at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2093)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1897)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2048)
	at java.io.DataInputStream.readFully(DataInputStream.java:178)
	at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2069)
	at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:68)
	at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
	at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

Another job (different but reading the same data) finished the map phase but 
died partially (half of the reducers) and completely frooze.

2011-12-29 18:07:58,899 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2011-12-29 18:07:58,899 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning 
block blk_4748641522370871094_13532
2011-12-29 18:07:58,900 INFO org.apache.hadoop.hdfs.DFSClient: Excluding 
datanode 141.105.120.154:50010
2011-12-29 18:07:58,902 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2011-12-29 18:07:58,902 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning 
block blk_-1454920600140944030_13532
2011-12-29 18:07:58,903 INFO org.apache.hadoop.hdfs.DFSClient: Excluding 
datanode 141.105.120.152:50010
2011-12-29 18:07:58,907 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
as 141.105.120.153:50010
2011-12-29 18:07:58,907 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning 
block blk_3551418605384221738_13532
2011-12-29 18:07:58,908 INFO org.apache.hadoop.hdfs.DFSClient: Excluding 
datanode 141.105.120.153:50010
2011-12-29 18:07:58,910 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2011-12-29 18:07:58,910 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning 
block blk_-1826030182013954555_13532
2011-12-29 18:07:58,911 INFO org.apache.hadoop.hdfs.DFSClient: Excluding 
datanode 141.105.120.150:50010
2011-12-29 18:07:58,911 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.io.IOException: Unable to create new block.
	at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3213)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2406)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2646)

2011-12-29 18:07:58,912 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_-1826030182013954555_13532 bad datanode[0] nodes == null
2011-12-29 18:07:58,912 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 
block locations. Source file "/user/systems/generate-
temp-1325180944829/_temporary/_attempt_201112290956_0012_r_000004_0/fetchlist-13/part-00004"

- Aborting...
2011-12-29 18:07:59,049 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-12-29 18:07:59,062 WARN org.apache.hadoop.mapred.Child: Error running 
child
java.io.EOFException
	at java.io.DataInputStream.readShort(DataInputStream.java:298)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3272)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3196)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2406)
	at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2646)
2011-12-29 18:07:59,064 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
for the task

It smells like the datanodes in 20.205.0 don't deal well with concurrent jobs, 
especially handling the same data.

Is there any advice for this? Again, this does not happen on 20.203.0.
Many thanks


> I should add that the failing tasks that ran concurrently all read the same
> map files from HDFS.
> 
> > Hi,
> > 
> > We just ran run large scale Apache Nutch jobs in our evaluation of
> > 20.205.0 and they all failed. Some of these jobs ran concurrently with
> > the fair scheduler enabled. These were simple jobs consuming little RAM.
> > I double checked and there were certainly no RAM issues.
> > 
> > All jobs failed and most tasks had a less than descriptive message. A few
> > told they dealt with I/O errors reading task output. However, the data
> > the read is fine. When we ran the same jobs manually (and some
> > concurrently) some did fine and others died for with I/O errors reading
> > task output again!
> > 
> > The heap allocation for the reducers is not high but no OOM's were
> > reported. Besides the occasional I/O error, which i think is strange
> > enough, most tasks did not write anything to the logs that i can link to
> > this problem.
> > 
> > We do not see this happening on our 20.203.0 cluster although resources
> > and settings are different. 205 is a new high-end cluster with similar
> > conservative settings but only more mappers/reducers per node. Resource
> > settings are almost identical. The 203 cluster has three times as many
> > machines so also more open file descriptors and threads.
> > 
> > Any thoughts to share?
> > Thanks,

Mime
View raw message