Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Date: Sun, 15 Nov 2015 23:03:11 +0000 (UTC)
From: "stack (JIRA)" <jira@apache.org>
To: dev@hbase.apache.org
Message-ID: <JIRA.12912628.1447366190000.75753.1447628591139@Atlassian.JIRA>
In-Reply-To: <JIRA.12912628.1447366190000@Atlassian.JIRA>
References: <JIRA.12912628.1447366190000@Atlassian.JIRA>
 <JIRA.12912628.1447366190534@arcas>
Subject: [jira] [Resolved] (HBASE-14802) Replaying server crash recovery
 procedure after a failover causes incorrect handling of deadservers
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HBASE-14802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-14802.
---------------------------
    Resolution: Fixed

Applied v4 to master branch. Applied the 14802.addendum.branch-1.txt to branch-1 and branch-1.2 so same as master.

Thanks for the patch [~ashu210890] Lets see how it does. Removing it made master branch pass again but odd how it was fine as is on branch-1 and branch-1.2

> Replaying server crash recovery procedure after a failover causes incorrect handling of deadservers
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14802
>                 URL: https://issues.apache.org/jira/browse/HBASE-14802
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.0.0, 1.2.0, 1.2.1
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>             Fix For: 2.0.0, 1.2.0, 1.3.0
>
>         Attachments: 14802.addendum.branch-1.txt, HBASE-14802-1.patch, HBASE-14802-2.patch, HBASE-14802-3.patch, HBASE-14802-4.patch, HBASE-14802.patch
>
>
> The way dead servers are processed is that a ServerCrashProcedure is launched for a server after it is added to the dead servers list. 
> Every time a server is added to the dead list, a counter "numProcessing" is incremented and it is decremented when a crash recovery procedure finishes. Since, adding a dead server and recovering it are two separate events, it can cause inconsistencies.
> If a master failover occurs in the middle of the crash recovery, the numProcessing counter resets but the ServerCrashProcedure is replayed by the new master. This causes the counter to go negative and makes the master think that dead servers are still in process of recovery. 
> This has ramifications on the balancer that the balancer ceases to run after such a failover.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)