hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8099) ReplicationZookeeper.copyQueuesFromRSUsingMulti should not return any queues if it failed to execute.
Date Thu, 14 Mar 2013 04:08:13 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602012#comment-13602012
] 

Lars Hofhansl commented on HBASE-8099:
--------------------------------------

That works. Personally I'd probably just return queues in the first case and do a clear()
for the second like this:
{code}
-      if (peerIdsToProcess == null) return null; // node already processed
+      if (peerIdsToProcess == null) return queues; // node already processed
...
       LOG.warn("Got exception in copyQueuesFromRSUsingMulti: ", e);
+      queues.clear();
{code}

Maybe while we're add it, we could add a random jitter to the failover.
Add a Random member to ReplicationSourceManager and than do this in NodeFailoverWorker:
{code}
-        Thread.sleep(sleepBeforeFailover);
+        Thread.sleep(sleepBeforeFailover + (long)(random.nextFloat()*sleepBeforeFailover));
{code}

                
> ReplicationZookeeper.copyQueuesFromRSUsingMulti should not return any queues if it failed
to execute.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8099
>                 URL: https://issues.apache.org/jira/browse/HBASE-8099
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Himanshu Vashishtha
>            Priority: Blocker
>             Fix For: 0.94.7
>
>         Attachments: HBase-8099-94.patch, HBase-8099-94-v2.patch, HBase-8099-trunk-2.patch,
HBase-8099-trunk.patch
>
>
> We just ran into an interesting scenario. We restarted a cluster that was setup as a
replication source.
> The stop went cleanly.
> Upon restart *all* regionservers aborted within a few seconds with variations of these
errors:
> http://pastebin.com/3iQVuBqS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message