hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Azuryy <azury...@gmail.com>
Subject Re: Exception in createBlockOutputStream trying to start application
Date Fri, 04 Apr 2014 22:44:36 GMT
Hi Mirillo,

Generally EOF error was caused by network 

You can ignore this Exception, which just tell you one DN in the pipe line is bad, dfsclient
will avoid locate it next time.


Sent from my iPhone5s

> On 2014年4月5日, at 1:25, Murillo Flores <murillo@chaordic.com.br> wrote:
> 
> Hello everybody,
> 
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode and a nodemanager
each, and 1 "master" running the namenode, the secondary namenode, the resource manager (yarn)
and the JobHistoryServer. In this cluster, eventually my attempts to start an application
that downloads all it's input data from s3 fails with messages like this one below:
> 
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020
from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020
from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split"
- Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020
from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020
from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root
(auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root
(auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad connect ack with
firstBadLink as 10.91.140.216:50010
> Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature EOF" exceptions.
Looks like the first datanode that receives the data from the namenode fails with an exception
like this one below:
> 
> 14/04/02 21:32:39 ERROR datanode.DataNode: DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
/mnt/data1/hadoop/current, /mnt/data2/hadoop/current, /mnt/data3/hadoop/current]'}, localName='ec2-54-80-176-6.compute-1.amazonaws.com:50010',
storageID='DS-1182379849-10.91.166.67-50010-1396473477988', xmitsInProgress=0}:Exception transfering
block BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror 10.65.0.211:50010:
java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
received exception java.net.SocketTimeoutException: 105000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776
remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode: ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver
error processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be
ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
> 
> And the other datanodes dealing with the same block (receiving it from another datanode
instead of namenode?), fails with this:
> 
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock: stage=PIPELINE_SETUP_CREATE,
clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, newGs=0, bytesRcvd=[0,
0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010, 10.91.133.28:50010,
10.91.140.216:50010, 10.65.2.239:50010, 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10,
srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false, isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size 131071 tcp no
delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020
from hadoop sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020
from hadoop got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call: blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaQsize,
type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotAvgTime,
type double, value 0.14285714285714288, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaNumOps,
type float, value 87, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumActiveSinks,
type int32, value 1, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.DroppedPubAll,
type float, value 0, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotNumOps,
type float, value 615, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumAllSinks,
type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for connect ack  from
downstream datanode with firstbadlink as 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect ack to upstream
firstbadlink is 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Exception for BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> 	at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
received exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode: ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver
error processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 
> 
> I've already changed some configurations trying to prevent this exceptions to happen,
like the ones below, and increased the maximun number of opened file descriptors and processes
(ulimit). Worth to note that none of them have helped so far :/
> 
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
> 
> Does anyone have any clue about what could be going on at my cluster, or where else could
I take a look to get some helpful information?
> Thanks.
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200
> 
> 
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200

Mime
View raw message