Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F1A73112D3 for ; Tue, 9 Sep 2014 04:51:28 +0000 (UTC) Received: (qmail 9848 invoked by uid 500); 9 Sep 2014 04:51:28 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 9810 invoked by uid 500); 9 Sep 2014 04:51:28 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 9798 invoked by uid 99); 9 Sep 2014 04:51:28 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Sep 2014 04:51:28 +0000 Date: Tue, 9 Sep 2014 04:51:28 +0000 (UTC) From: "Enis Soztutar (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-9591) [replication] getting "Current list of sinks is out of date" all the time when a source is recovered MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-9591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-9591: --------------------------------- Fix Version/s: (was: 0.99.0) 0.99.1 > [replication] getting "Current list of sinks is out of date" all the time when a source is recovered > ---------------------------------------------------------------------------------------------------- > > Key: HBASE-9591 > URL: https://issues.apache.org/jira/browse/HBASE-9591 > Project: HBase > Issue Type: Bug > Affects Versions: 0.96.0 > Reporter: Jean-Daniel Cryans > Priority: Minor > Fix For: 0.99.1 > > > I tried killing a region server when the slave cluster was down, from that point on my log was filled with: > {noformat} > 2013-09-20 00:31:03,942 INFO [regionserver60020.replicationSource,1] org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager: Current list of sinks is out of date, updating > 2013-09-20 00:31:04,226 INFO [ReplicationExecutor-0.replicationSource,1-jdec2hbase0403-4,60020,1379636329634] org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager: Current list of sinks is out of date, updating > {noformat} > The first log line is from the normal source, the second is the recovered one. When we try to replicate, we call replicationSinkMgr.getReplicationSink() and if the list of machines was refreshed since the last time then we call chooseSinks() which in turn refreshes the list of sinks and resets our lastUpdateToPeers. The next source will notice the change, and will call chooseSinks() too. The first source is coming for another round, sees the list was refreshed, calls chooseSinks() again. It happens forever until the recovered queue is gone. > We could have all the sources going to the same cluster share a thread-safe ReplicationSinkManager. We could also manage the same cluster separately for each source. Or even easier, if the list we get in chooseSinks() is the same we had before, consider it a noop. > What do you think [~gabriel.reid]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)