hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jagane Sundar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
Date Tue, 28 May 2013 13:47:20 GMT
Jagane Sundar created HDFS-4858:
-----------------------------------

             Summary: HDFS DataNode to NameNode RPC should timeout
                 Key: HDFS-4858
                 URL: https://issues.apache.org/jira/browse/HDFS-4858
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.0.4-alpha, 3.0.0, 2.0.5-beta, 2.0.4.1-alpha
         Environment: Redhat/CentOS 6.4 64 bit Linux
            Reporter: Jagane Sundar
            Priority: Minor
             Fix For: 3.0.0, 2.0.5-beta


The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration
means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the
Standby NameNode does not respond to a sendHeartbeat.

What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes
that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See
Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode
never re-registers with the Standby NameNode. Thereafter failover completely fails.

The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and
keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect,
re-register, and offer service.

Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode
should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object.

Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:

Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to vmhost6-vm1/10.10.10.151:8020):
  State: WAITING
  Blocked count: 23843
  Waited count: 45676
  Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    org.apache.hadoop.ipc.Client.call(Client.java:1220)
    org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
    sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
    sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
    org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
    sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
    org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
    org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
    org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
    org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
    java.lang.Thread.run(Thread.java:662)

DataNode RPC to Standby NameNode never times out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message