hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ed <edor...@gmail.com>
Subject DataXceiver WRITE_BLOCK: Premature EOF from inputStream: Using Avro Multiple Outputs
Date Sat, 04 Jul 2015 20:47:24 GMT
Hello,

We are running a job that makes use of Avro Multiple Ouputs (Avro 1.7.5).
When there are lots of output files the job was failing with the following
error which I believed caused the job to fail:


hc1hdfs2p.thecarlylegroup.local:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent
xcievers: 4096
         at
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:137)
         at java.lang.Thread.run(Thread.java:744)

This error starts to appear when we have lots of output directories due to
our use of AvroMultipleOutputs  (all maps complete without issue and the
multi output is being done in the reducers which fail).  I went ahead and
increased dfs.datanode.max.xcievers to 8192 and reran the job. Using
Cloudera Manager I saw that the Transceivers across nodes maxed out at 5376
so setting the max to 8192 solved that first error.  Unfortunately, the job
still failed.  When checking HDFS logs the above error about the xciever
limit was gone but I now saw lots of the following.

hc1hdfs3p.thecarlylegroup.local:50010:DataXceiver error processing
WRITE_BLOCK operation  src: /10.14.5.83:53280 dest: /10.14.5.81:50010
java.io.IOException: Premature EOF from inputStream
         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
         at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
         at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
         at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
         at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
         at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
         at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:711)
         at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
         at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
         at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
         at java.lang.Thread.run(Thread.java:744)

The above errors seem to be happening a lot and I'm not sure if they are
related to the job failure.  This error seems to match exactly the error
pattern seen in this thread below (which unfortunately had no responses)

http://mail-archives.apache.org/mod_mbox/hadoop-user/201408.mbox/%3CCAJOOh6E1D1bx_9NrAUPPzAb6x1=Fxd52RGqWXfzWY5TPjiWCxg@mail.gmail.com%3E

The only other warnings I see occurring around the same time as the job
failure are:

WARN Failed to place enough replicas, still in need of 1 to reach 3. For
> more information, please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
>  WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor Exit
> code from container container_1417712817932_31879_01_002464 is : 143



Does anyone have any ideas for what could be causing the job to fail? I did
not see anything obvious looking through Cloudera Manager charts or logs.
For example, open files was below the limit and memory was well within what
the nodes have (6 nodes with 90GB each).  No errors in YARN either.

Thank you!

Best,

Ed

Mime
View raw message