hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "TanYuxin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart
Date Tue, 31 Oct 2017 07:38:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226405#comment-16226405
] 

TanYuxin commented on HDFS-12749:
---------------------------------


{code}
// DN and NN log and timeline about DN re-register
// DN start re-register:
2017-10-26 12:59:35,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand
action : DNA_REGISTER from Namenode_Host/IP:Port with standby state

// DN SocketTimeoutException and retry to register
2017-10-26 13:02:08,497 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting
to server: Namenode_Host/IP:Port
2017-10-26 13:06:34,204 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting
to server: Namenode_Host/IP:Port

// NN Successfully register the DN
2017-10-26 13:07:24,325 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode:
from DatanodeRegistration(DataNode_IP:Port, datanodeUuid=DataNodeUuid, infoPort=DataNode_port,
infoSecurePort=0, ipcPort=IPCPort, storageInfo=lv=-57;cid=CID-;nsid=NSID;c=0) storage Storage_ID

// DN get IOExecption:
2017-10-26 13:07:35,265 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing
datanode Command
java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/DataNode_IP:Port remote=Namenode_Host/IP:Port]; Host Details : local host is: "DataNode_Host/IP";
destination host is: "NameNode_Host":Port;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
        at org.apache.hadoop.ipc.Client.call(Client.java:1474)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(
        ...
{code}


> DN may not send block report to NN after NN restart
> ---------------------------------------------------
>
>                 Key: HDFS-12749
>                 URL: https://issues.apache.org/jira/browse/HDFS-12749
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: TanYuxin
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN restart,
NN's load is very high.
> After NN restart´╝îDN will call BPServiceActor#reRegister method to register. But register
RPC will get a IOException since NN is busy dealing with Block Report.  The exception is caught
at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local host is: "DataNode_Host/Datanode_IP";
destination host is: "NameNode_Host":Port;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1474)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
>         at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block Report can not
be sent immediately. 
> {code}
>   /**
>    * Register one bp with the corresponding NameNode
>    * <p>
>    * The bpDatanode needs to register with the namenode on startup in order
>    * 1) to report which storage it is serving now and 
>    * 2) to receive a registrationID
>    *  
>    * issued by the namenode to recognize registered datanodes.
>    * 
>    * @param nsInfo current NamespaceInfo
>    * @see FSNamesystem#registerDatanode(DatanodeRegistration)
>    * @throws IOException
>    */
>   void register(NamespaceInfo nsInfo) throws IOException {
>     // The handshake() phase loaded the block pool storage
>     // off disk - so update the bpRegistration object from that info
>     DatanodeRegistration newBpRegistration = bpos.createRegistration();
>     LOG.info(this + " beginning handshake with NN");
>     while (shouldRun()) {
>       try {
>         // Use returned registration from namenode with updated fields
>         newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
>         newBpRegistration.setNamespaceInfo(nsInfo);
>         bpRegistration = newBpRegistration;
>         break;
>       } catch(EOFException e) {  // namenode might have just restarted
>         LOG.info("Problem connecting to server: " + nnAddr + " :"
>             + e.getLocalizedMessage());
>         sleepAndLogInterrupts(1000, "connecting to server");
>       } catch(SocketTimeoutException e) {  // namenode is busy
>         LOG.info("Problem connecting to server: " + nnAddr);
>         sleepAndLogInterrupts(1000, "connecting to server");
>       }
>     }
>     
>     LOG.info("Block pool " + this + " successfully registered with NN");
>     bpos.registrationSucceeded(this, bpRegistration);
>     // random short delay - helps scatter the BR from all DNs
>     scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN to re-register
again



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message