Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FCDC175CA for ; Fri, 30 Jan 2015 21:05:36 +0000 (UTC) Received: (qmail 95880 invoked by uid 500); 30 Jan 2015 21:05:35 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 95702 invoked by uid 500); 30 Jan 2015 21:05:35 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 95399 invoked by uid 99); 30 Jan 2015 21:05:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jan 2015 21:05:35 +0000 Date: Fri, 30 Jan 2015 21:05:35 +0000 (UTC) From: "Chris Nauroth (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7714) Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully with only one NameNode. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Chris Nauroth created HDFS-7714: ----------------------------------- Summary: Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully with only one NameNode. Key: HDFS-7714 URL: https://issues.apache.org/jira/browse/HDFS-7714 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Chris Nauroth In an HA deployment, DataNodes must register with both NameNodes and send periodic heartbeats and block reports to both. However, if NameNodes and DataNodes are restarted simultaneously, then this can trigger a race condition in registration. The end result is that the {{BPServiceActor}} for one NameNode terminates, but the {{BPServiceActor}} for the other NameNode remains alive. The DataNode process is then in a "half-alive" state where it only heartbeats and sends block reports to one of the NameNodes. This could cause a loss of storage capacity after an HA failover. The DataNode process would have to be restarted to resolve this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)