hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart
Date Tue, 31 Oct 2017 19:53:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227402#comment-16227402
] 

Kihwal Lee commented on HDFS-12749:
-----------------------------------

The rpc calls are timing out because the NN is not able to serve them fast enough. Influx
of full block reports can slow things down.  First of all, they can timeout but will be retried.
I.e. datanodes will retransmit.  Also, make sure your NN is configured correctly. E.g. tcp
listen queue size, # of handlers, etc.  are enough to absorb surges in requests.  If there
are too many large block reports, it may not be possible to completely avoid timeout-retransmission.
This also increases the amount/size of objects sitting in the heap and potentially promoted
to the old gen prematurely, increasing chance of a full GC.   To lessen the memory pressure
during the block report surges, config your cluster to break down full block report down to
individual storage level. That way, each RPC will be smaller.

> DN may not send block report to NN after NN restart
> ---------------------------------------------------
>
>                 Key: HDFS-12749
>                 URL: https://issues.apache.org/jira/browse/HDFS-12749
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: TanYuxin
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN restart,
NN's load is very high.
> After NN restart´╝îDN will call BPServiceActor#reRegister method to register. But register
RPC will get a IOException since NN is busy dealing with Block Report.  The exception is caught
at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local host is: "DataNode_Host/Datanode_IP";
destination host is: "NameNode_Host":Port;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1474)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
>         at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block Report can not
be sent immediately. 
> {code}
>   /**
>    * Register one bp with the corresponding NameNode
>    * <p>
>    * The bpDatanode needs to register with the namenode on startup in order
>    * 1) to report which storage it is serving now and 
>    * 2) to receive a registrationID
>    *  
>    * issued by the namenode to recognize registered datanodes.
>    * 
>    * @param nsInfo current NamespaceInfo
>    * @see FSNamesystem#registerDatanode(DatanodeRegistration)
>    * @throws IOException
>    */
>   void register(NamespaceInfo nsInfo) throws IOException {
>     // The handshake() phase loaded the block pool storage
>     // off disk - so update the bpRegistration object from that info
>     DatanodeRegistration newBpRegistration = bpos.createRegistration();
>     LOG.info(this + " beginning handshake with NN");
>     while (shouldRun()) {
>       try {
>         // Use returned registration from namenode with updated fields
>         newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
>         newBpRegistration.setNamespaceInfo(nsInfo);
>         bpRegistration = newBpRegistration;
>         break;
>       } catch(EOFException e) {  // namenode might have just restarted
>         LOG.info("Problem connecting to server: " + nnAddr + " :"
>             + e.getLocalizedMessage());
>         sleepAndLogInterrupts(1000, "connecting to server");
>       } catch(SocketTimeoutException e) {  // namenode is busy
>         LOG.info("Problem connecting to server: " + nnAddr);
>         sleepAndLogInterrupts(1000, "connecting to server");
>       }
>     }
>     
>     LOG.info("Block pool " + this + " successfully registered with NN");
>     bpos.registrationSucceeded(this, bpRegistration);
>     // random short delay - helps scatter the BR from all DNs
>     scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN to re-register
again



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message