Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Fri, 18 Apr 2014 07:45:16 +0000 (UTC)
From: "Vinayakumar B (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12690252.1390392323089.137531.1397807116673@arcas>
In-Reply-To: <JIRA.12690252.1390392323089@arcas>
References: <JIRA.12690252.1390392323089@arcas>
Subject: [jira] [Updated] (HADOOP-10251) Both NameNodes could be in STANDBY
 State if SNN network is unstable
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinayakumar B updated HADOOP-10251:
-----------------------------------

    Attachment: HADOOP-10251.patch

Updated the java doc for the interface. 
Removed the implementation level details in interface.

> Both NameNodes could be in STANDBY State if SNN network is unstable
> -------------------------------------------------------------------
>
>                 Key: HADOOP-10251
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10251
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.2.0
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>            Priority: Critical
>         Attachments: HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch
>
>
> Following corner scenario happened in one of our cluster.
> 1. NN1 was Active and NN2 was Standby
> 2. NN2 machine's network was slow 
> 3. NN1 got shutdown.
> 4. NN2 ZKFC got the notification and trying to check for old active for fencing. (This took little more time, again due to slow network)
> 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made it Active.
> 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to STANBY.
> 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and got shutdown before making NN2 Active.
> *Now cluster having both NameNodes as STANDBY.*
> NN1 ZKFC still thinks that its nameNode is in Active state. 
> NN2 ZKFC waiting for election.


--
This message was sent by Atlassian JIRA
(v6.2#6252)