Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F8F318BF3 for ; Sat, 14 Nov 2015 11:26:11 +0000 (UTC) Received: (qmail 74466 invoked by uid 500); 14 Nov 2015 11:26:11 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 74415 invoked by uid 500); 14 Nov 2015 11:26:11 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 74071 invoked by uid 99); 14 Nov 2015 11:26:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Nov 2015 11:26:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 24EB82C1F5A for ; Sat, 14 Nov 2015 11:26:11 +0000 (UTC) Date: Sat, 14 Nov 2015 11:26:11 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-14802) Replaying server crash recovery procedure after a failover causes incorrect handling of deadservers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-14802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005340#comment-15005340 ] Hudson commented on HBASE-14802: -------------------------------- FAILURE: Integrated in HBase-Trunk_matrix #463 (See [https://builds.apache.org/job/HBase-Trunk_matrix/463/]) HBASE-14802 Replaying server crash recovery procedure after a failover (stack: rev 7c3c9ac9c67cd03f9a915f528d22cb4ed81cb6e8) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/DeadServer.java * hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDeadServer.java * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java > Replaying server crash recovery procedure after a failover causes incorrect handling of deadservers > --------------------------------------------------------------------------------------------------- > > Key: HBASE-14802 > URL: https://issues.apache.org/jira/browse/HBASE-14802 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 2.0.0, 1.2.0, 1.2.1 > Reporter: Ashu Pachauri > Assignee: Ashu Pachauri > Fix For: 2.0.0, 1.2.0, 1.3.0 > > Attachments: HBASE-14802-1.patch, HBASE-14802-2.patch, HBASE-14802-3.patch, HBASE-14802.patch > > > The way dead servers are processed is that a ServerCrashProcedure is launched for a server after it is added to the dead servers list. > Every time a server is added to the dead list, a counter "numProcessing" is incremented and it is decremented when a crash recovery procedure finishes. Since, adding a dead server and recovering it are two separate events, it can cause inconsistencies. > If a master failover occurs in the middle of the crash recovery, the numProcessing counter resets but the ServerCrashProcedure is replayed by the new master. This causes the counter to go negative and makes the master think that dead servers are still in process of recovery. > This has ramifications on the balancer that the balancer ceases to run after such a failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)