Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBE7510435 for ; Fri, 31 Jan 2014 00:06:35 +0000 (UTC) Received: (qmail 81294 invoked by uid 500); 31 Jan 2014 00:06:24 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 81222 invoked by uid 500); 31 Jan 2014 00:06:22 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 81167 invoked by uid 99); 31 Jan 2014 00:06:19 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Jan 2014 00:06:19 +0000 Date: Fri, 31 Jan 2014 00:06:19 +0000 (UTC) From: "Konstantin Boudnik (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: ------------------------------------- Status: Open (was: Patch Available) > HDFS DataNode to NameNode RPC should timeout > -------------------------------------------- > > Key: HDFS-4858 > URL: https://issues.apache.org/jira/browse/HDFS-4858 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0 > Environment: Redhat/CentOS 6.4 64 bit Linux > Reporter: Jagane Sundar > Assignee: Jagane Sundar > Priority: Minor > Fix For: 3.0.0, 2.3.0 > > Attachments: HDFS-4858.patch, HDFS-4858.patch > > > The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. > What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. > The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. > Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. > Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: > Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): > State: WAITING > Blocked count: 23843 > Waited count: 45676 > Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 > Stack: > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:485) > org.apache.hadoop.ipc.Client.call(Client.java:1220) > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) > sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > java.lang.reflect.Method.invoke(Method.java:597) > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) > java.lang.Thread.run(Thread.java:662) > DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)