hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murillo Flores <muri...@chaordic.com.br>
Subject Re: Exception in createBlockOutputStream trying to start application
Date Mon, 07 Apr 2014 20:14:48 GMT
Thanks for your answer Azuryy.
Do you know any configuration that we could tune to be more permissive with
this errors? I'm not sure that 'dfs.client.block.write.retries' will serve
this purpose.




2014-04-04 19:44 GMT-03:00 Azuryy <azuryyyu@gmail.com>:

> Hi Mirillo,
>
> Generally EOF error was caused by network
>
> You can ignore this Exception, which just tell you one DN in the pipe line
> is bad, dfsclient will avoid locate it next time.
>
>
> Sent from my iPhone5s
>
> On 2014年4月5日, at 1:25, Murillo Flores <murillo@chaordic.com.br> wrote:
>
> Hello everybody,
>
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode
> and a nodemanager each, and 1 "master" running the namenode, the secondary
> namenode, the resource manager (yarn) and the JobHistoryServer. In this
> cluster, eventually my attempts to start an application that downloads all
> it's input data from s3 fails with messages like this one below:
>
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode
> 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations.
> Source file
> "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" -
> Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
> area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad
> connect ack with firstBadLink as 10.91.140.216:50010
>  Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file
> /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
>
>
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature
> EOF" exceptions. Looks like the first datanode that receives the data from
> the namenode fails with an exception like this one below:
>
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
> /mnt/data1/hadoop/current, /mnt/data2/hadoop/current,
> /mnt/data3/hadoop/current]'}, localName='
> ec2-54-80-176-6.compute-1.amazonaws.com:50010',
> storageID='DS-1182379849-10.91.166.67-50010-1396473477988',
> xmitsInProgress=0}:Exception transfering block
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror
> 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout
> while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.net.SocketTimeoutException: 105000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /
> 10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
>  at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>  at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>  at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
>  at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
>
>
> And the other datanodes dealing with the same block (receiving it from
> another datanode instead of namenode?), fails with this:
>
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock:
> stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010,
> 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010,
> 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false,
> isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size
> 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /
> 10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
> of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
> of size 0
>  14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode
> 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call:
> blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope
> both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotAvgTime, type double, value
> 0.14285714285714288, slope both from hostname
> ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for
> connect ack  from downstream datanode with firstbadlink as
> 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect
> ack to upstream firstbadlink is 10.91.133.28:50010
>  14/04/02 21:32:41 INFO datanode.DataNode: Exception for
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
>  at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
>  at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode:
> ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /
> 10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>
>
>
> I've already changed some configurations trying to prevent this exceptions
> to happen, like the ones below, and increased the maximun number of opened
> file descriptors and processes (ulimit). Worth to note that none of them
> have helped so far :/
>
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
>
> Does anyone have any clue about what could be going on at my cluster, or
> where else could I take a look to get some helpful information?
> Thanks.
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>


-- 
*Murillo*

 Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200

Mime
View raw message