Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4735E10960 for ; Sun, 10 May 2015 00:54:00 +0000 (UTC) Received: (qmail 62890 invoked by uid 500); 10 May 2015 00:54:00 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 62841 invoked by uid 500); 10 May 2015 00:54:00 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 62829 invoked by uid 99); 10 May 2015 00:54:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 May 2015 00:54:00 +0000 Date: Sun, 10 May 2015 00:53:59 +0000 (UTC) From: "Lars Hofhansl (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-13618) ReplicationSource is too eager to remove sinks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-13618: ---------------------------------- Attachment: 13618-v2.txt I like -v2 better since it simplifies the success code a bit. Instead of decreasing the fail counter, a success resets the counter. [~apurtell], can you have a quick look again? > ReplicationSource is too eager to remove sinks > ---------------------------------------------- > > Key: HBASE-13618 > URL: https://issues.apache.org/jira/browse/HBASE-13618 > Project: HBase > Issue Type: Bug > Reporter: Lars Hofhansl > Assignee: Lars Hofhansl > Priority: Minor > Attachments: 13618-v2.txt, 13618.txt > > > Looking at the replication for some other reason I noticed that the replication source might be a bit too eager to remove sinks from the list of valid sinks. > The current logic allows a sink to fail N times (default 3) and then it will be remove from the sinks. But note that this failure count is never reduced, so given enough runtime and some network glitches _every_ sink will eventually be removed. When all sink are removed the source pick new sinks and the counter is set to 0 for all of them. > I think we should change to reset the counter each time we successfully replicate something to the sink (which proves the sink isn't dead). Or we could decrease the counter each time we successfully replication, that might be better - if we consistently fail more attempts than we succeed the sink should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)