Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 14 Mar 2013 04:08:13 +0000 (UTC)
From: "Lars Hofhansl (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12636935.1363224233740.438911.1363234093298@arcas>
In-Reply-To: <JIRA.12636935.1363224233740@arcas>
References: <JIRA.12636935.1363224233740@arcas>
Subject: [jira] [Commented] (HBASE-8099)
 ReplicationZookeeper.copyQueuesFromRSUsingMulti should not return any
 queues if it failed to execute.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602012#comment-13602012 ] 

Lars Hofhansl commented on HBASE-8099:
--------------------------------------

That works. Personally I'd probably just return queues in the first case and do a clear() for the second like this:
{code}
-      if (peerIdsToProcess == null) return null; // node already processed
+      if (peerIdsToProcess == null) return queues; // node already processed
...
       LOG.warn("Got exception in copyQueuesFromRSUsingMulti: ", e);
+      queues.clear();
{code}

Maybe while we're add it, we could add a random jitter to the failover.
Add a Random member to ReplicationSourceManager and than do this in NodeFailoverWorker:
{code}
-        Thread.sleep(sleepBeforeFailover);
+        Thread.sleep(sleepBeforeFailover + (long)(random.nextFloat()*sleepBeforeFailover));
{code}

                
> ReplicationZookeeper.copyQueuesFromRSUsingMulti should not return any queues if it failed to execute.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8099
>                 URL: https://issues.apache.org/jira/browse/HBASE-8099
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Himanshu Vashishtha
>            Priority: Blocker
>             Fix For: 0.94.7
>
>         Attachments: HBase-8099-94.patch, HBase-8099-94-v2.patch, HBase-8099-trunk-2.patch, HBase-8099-trunk.patch
>
>
> We just ran into an interesting scenario. We restarted a cluster that was setup as a replication source.
> The stop went cleanly.
> Upon restart *all* regionservers aborted within a few seconds with variations of these errors:
> http://pastebin.com/3iQVuBqS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira