hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuzhang Han <yuzhanghan1...@gmail.com>
Subject Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution
Date Tue, 25 Jun 2013 04:15:45 GMT
Omkar, by the way, I see following exception in my nodemanager log. Do you
think it is related?

2013-06-24 16:15:48,039 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-06-24 16:17:35,413 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught
exception in status-updater
java.lang.reflect.UndeclaredThrowableException
    at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:72)
    at
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:343)
Caused by: com.google.protobuf.ServiceException: java.io.IOException:
Failed on local exception: java.io.IOException: Response is null.; Host
Details : local host is: "ip-10-224-2-190/10.224.2.190"; destination host
is: "ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
    at $Proxy23.nodeHeartbeat(Unknown Source)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
    ... 1 more
Caused by: java.io.IOException: Failed on local exception:
java.io.IOException: Response is null.; Host Details : local host is:
"ip-10-224-2-190/10.224.2.190"; destination host is: "
ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
    at org.apache.hadoop.ipc.Client.call(Client.java:1239)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
    ... 3 more
Caused by: java.io.IOException: Response is null.
    at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
2013-06-24 16:17:37,420 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:38,421 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:39,422 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,304 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is
stopped.
2013-06-24 16:17:40,305 INFO
org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is
stopped.
2013-06-24 16:17:40,322 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:8042
2013-06-24 16:17:40,422 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is
stopped.
2013-06-24 16:17:40,422 INFO org.apache.hadoop.ipc.Server: Stopping server
on 51111
2013-06-24 16:17:40,423 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,424 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server listener on 51111
2013-06-24 16:17:40,425 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server Responder
2013-06-24 16:17:40,426 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
is stopped.


On Tue, Jun 25, 2013 at 12:04 AM, Yuzhang Han <yuzhanghan1982@gmail.com>wrote:

> Thank you, Omkar.
>
> I didn't see other errors on datanode and namenode logs. My namenode 50070
> interface shows
>
> Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used : 137.51
> GB DFS Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block
> Pool Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %Median %Max %stdev
%
>
> 14.55% 16.37% 16.37% 0.91%
>
>
> It doesn't imply insufficient disk space, does it? Can you think of any
> other possible cause of the exceptions?
>
>
> On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <ojoshi@hortonworks.com>wrote:
>
>> Hi,
>>
>> I see there are 2 datanodes and for some reason namenode is not able to
>> create even single replica for requested blocks. are you sure the system on
>> which these datanodes are running have sufficient disk space? Do you see
>> any other errors in datanode/namenode logs?
>>
>> What must be happening is as file creation in hdfs is failing it is
>> marking that reduce attempt as failure and restarting it. Keep checking
>> namenode state when it reaches 67%.
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yuzhanghan1982@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I am using YARN. I get some exceptions at my namenode and datanode. They
>>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>>> restarted from 0% several times, but always restarts at this point. Can
>>> someone tell me what I should do? Many thanks!
>>>
>>>
>>> Namenode log:
>>>
>>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap
updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy:
Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002
could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s)
running and no node(s) are excluded in this operation.
>>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler
1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375:
error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002
could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s)
running and no node(s) are excluded in this operation.
>>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002
could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s)
running and no node(s) are excluded in this operation.
>>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap
updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy:
Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>>
>>>
>>>
>>> Datanode log:
>>>
>>> 2013-06-24 19:25:54,695 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>>> 2013-06-24 19:25:54,699 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>>> 2013-06-24 19:25:56,473 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>>> java.io.IOException: Connection reset by peer
>>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>>     at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>>     at java.lang.Thread.run(Thread.java:679)
>>> 2013-06-24 19:25:56,476 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>>
>>
>>
>

Mime
View raw message