hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "TanYuxin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart
Date Tue, 31 Oct 2017 06:21:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

TanYuxin updated HDFS-12749:
----------------------------
    Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. When NN restart,
NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. But register
RPC will get a IOException since NN is busy dealing with Block Report.  The exception is caught
at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host Details : local
host is: "datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
        at org.apache.hadoop.ipc.Client.call(Client.java:1474)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
        at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  scheduler.scheduleBlockReport method
can't be run, and the Block Report will not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will not make DN
register again, which makes Block Report  is not sent

  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
After SNN restart,DN will call BPServiceActor#reRegister method to register. But register
RPC will get a IOException since NN is busy dealing with Block Report.  The exception is caught
at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host Details : local
host is: "datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
        at org.apache.hadoop.ipc.Client.call(Client.java:1474)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
        at java.lang.Thread.run(Thread.java:745)
{code}




> DN may not send block report to NN after NN restart
> ---------------------------------------------------
>
>                 Key: HDFS-12749
>                 URL: https://issues.apache.org/jira/browse/HDFS-12749
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: TanYuxin
>
> Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. When NN
restart, NN's load is very high.
> After SNN restart,DN will call BPServiceActor#reRegister method to register. But register
RPC will get a IOException since NN is busy dealing with Block Report.  The exception is caught
at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException:
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host Details : local
host is: "datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1474)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
>         at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> If encountering a IOException in BPServiceActor#register,  scheduler.scheduleBlockReport
method can't be run, and the Block Report will not be sent immediately. 
> But NN has get the register RPC, and successfully register the DN. So NN will not make
DN register again, which makes Block Report  is not sent



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message