hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murillo Flores <muri...@chaordic.com.br>
Subject Exception in createBlockOutputStream trying to start application
Date Fri, 04 Apr 2014 17:25:51 GMT
Hello everybody,

I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode and
a nodemanager each, and 1 "master" running the namenode, the secondary
namenode, the resource manager (yarn) and the JobHistoryServer. In this
cluster, eventually my attempts to start an application that downloads all
it's input data from s3 fails with messages like this one below:

14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
10.91.140.216:50010
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning
BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root sending #113
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root got value #113
14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode
10.91.140.216:50010
14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations.
Source file
"/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" -
Aborting...
14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area
/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root sending #114
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root got value #114
14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
14/04/02 21:43:12 ERROR security.UserGroupInformation:
PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
Bad connect ack with firstBadLink as 10.91.140.216:50010
14/04/02 21:43:12 ERROR security.UserGroupInformation:
PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
Bad connect ack with firstBadLink as 10.91.140.216:50010
14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad
connect ack with firstBadLink as 10.91.140.216:50010
 Streaming Command Failed!
14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file
/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
java.io.IOException: Bad connect ack with firstBadLink as
10.91.140.216:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)


In the datanodes log, I'm seeing both "Connection timeout" and "Premature
EOF" exceptions. Looks like the first datanode that receives the data from
the namenode fails with an exception like this one below:

14/04/02 21:32:39 ERROR datanode.DataNode:
DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
/mnt/data1/hadoop/current, /mnt/data2/hadoop/current,
/mnt/data3/hadoop/current]'}, localName='
ec2-54-80-176-6.compute-1.amazonaws.com:50010',
storageID='DS-1182379849-10.91.166.67-50010-1396473477988',
xmitsInProgress=0}:Exception transfering block
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror
10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/
10.65.0.211:50010]
14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
exception java.net.SocketTimeoutException: 105000 millis timeout while
waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/
10.65.0.211:50010]
14/04/02 21:32:39 ERROR datanode.DataNode:
ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error processing
WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /10.91.166.67:50010
java.net.SocketTimeoutException: 105000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/
10.65.0.211:50010]
 at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)


And the other datanodes dealing with the same block (receiving it from
another datanode instead of namenode?), fails with this:

14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock:
stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
  block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
newGs=0, bytesRcvd=[0, 0]
  targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010,
10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010,
10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false, isClient=true,
isTransfer=false
14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size
131071 tcp no delay true
14/04/02 21:30:54 INFO datanode.DataNode: Receiving
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /
10.91.166.67:37776 dest: /10.65.0.211:50010
14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver:
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
  isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
  isDatanode=false, srcDataNode=:0
  inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
  cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is
/mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
of size 0
14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is
/mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
of size 0
 14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode
10.91.165.2:50010
14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
sending #234
14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop got
value #234
14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call:
blockReceivedAndDeleted took 3ms
--
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope
both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.SnapshotAvgTime, type double, value
0.14285714285714288, slope both from hostname
ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope
positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both
from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope
positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope
positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both
from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for
connect ack  from downstream datanode with firstbadlink as
10.91.133.28:50010
14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect ack
to upstream firstbadlink is 10.91.133.28:50010
 14/04/02 21:32:41 INFO datanode.DataNode: Exception for
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
java.io.IOException: Premature EOF from inputStream
 at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
 at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
 at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
 at java.lang.Thread.run(Thread.java:744)
14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
type=HAS_DOWNSTREAM_IN_PIPELINE terminating
14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
exception java.io.IOException: Premature EOF from inputStream
14/04/02 21:32:41 ERROR datanode.DataNode:
ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error
processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /
10.65.0.211:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
 at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)



I've already changed some configurations trying to prevent this exceptions
to happen, like the ones below, and increased the maximun number of opened
file descriptors and processes (ulimit). Worth to note that none of them
have helped so far :/

dfs.datanode.max.xcievers -> 4096
dfs.client.block.write.retries -> 12
dfs.namenode.handler.count -> 64
dfs.replication -> 3
ulimit:
* soft nofile 65536
* hard nofile 65536

Does anyone have any clue about what could be going on at my cluster, or
where else could I take a look to get some helpful information?
Thanks.

-- 
*Murillo*

Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200



-- 
*Murillo*

Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200

Mime
View raw message