hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jagane Sundar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
Date Mon, 12 Aug 2013 19:30:49 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737264#comment-13737264
] 

Jagane Sundar commented on HDFS-4858:
-------------------------------------

Manually, this is quite easy to simulate. I have been using ifdown to recreate this problem
consistently. My question is for simulating this in a unit test. If the TCP stack on the Secondary
Name Node machine is alive, then it will send out a RST if the NN process dies. The only way
to cut off all communications related to the socket is by killing the network interface, and
I don't believe that this can be accomplished from Java (I would be happy to be corrected
in this regard).
                
> HDFS DataNode to NameNode RPC should timeout
> --------------------------------------------
>
>                 Key: HDFS-4858
>                 URL: https://issues.apache.org/jira/browse/HDFS-4858
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
>         Environment: Redhat/CentOS 6.4 64 bit Linux
>            Reporter: Jagane Sundar
>            Priority: Minor
>         Attachments: HDFS-4858.patch
>
>
> The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This
configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds
if the Standby NameNode does not respond to a sendHeartbeat.
> What we observe is this: If the Standby NameNode happens to reboot for any reason, the
DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat.
See Stack trace included below. When the Standby NameNode comes back up, we find that the
DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails.
> The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds,
and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should
reconnect, re-register, and offer service.
> Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode
should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object.
> Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
> Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to vmhost6-vm1/10.10.10.151:8020):
>   State: WAITING
>   Blocked count: 23843
>   Waited count: 45676
>   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.ipc.Client.call(Client.java:1220)
>     org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>     sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
>     sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>     sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     java.lang.reflect.Method.invoke(Method.java:597)
>     org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>     org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>     sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
>     org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
>     org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
>     org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
>     org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
>     java.lang.Thread.run(Thread.java:662)
> DataNode RPC to Standby NameNode never times out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message