hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Marc Spaggiari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10100) Hbase replication cluster can have varying peers under certain conditions
Date Sat, 07 Dec 2013 11:52:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13842197#comment-13842197

Jean-Marc Spaggiari commented on HBASE-10100:

[~jdcryans] probably something you want to look at...

> Hbase replication cluster can have varying peers under certain conditions
> -------------------------------------------------------------------------
>                 Key: HBASE-10100
>                 URL: https://issues.apache.org/jira/browse/HBASE-10100
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.5, 0.95.0, 0.96.0
>            Reporter: churro morales
> We were trying to replicate hbase data over to a new datacenter recently.  After we turned
on replication and then did our copy tables.  We noticed that verify replication had discrepancies.
> We ran a list_peers and it returned back both peers, the original datacenter we were
replicating to and the new datacenter (this was correct).  
> When grepping through the logs for a few regionservers we noticed that a few regionservers
had the following entry in their logs:
> 2013-09-26 10:55:46,907 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Error while adding a new peer java.net.UnknownHostException: xxx.xxx.flurry.com (this was
due to a transient dns issue)
> Thus a very small subet of our regionservers were not replicating to this new cluster
while most were. 
> We probably don't want to abort if this type of issue comes up, it could potentially
be fatal if someone does an "add_peer" operation with a typo.  This could potentially shut
down the cluster. 
> One solution I can think of is keeping some flag in ReplicationSourceManager which is
a boolean that keeps track of whether there was an errorAddingPeer.  Then in the logPositionAndCleanOldLogs
we can do something like:
> {code}
> if (errorAddingPeer) {
>       LOG.error("There was an error adding a peer, logs will not be marked for deletion");
>       return;
>     }
> {code}
> thus we are not deleting these logs from the queue.  You will notice your replicating
queue rising on certain machines and you can still replay the logs, thus avoiding a lengthy
copy table. 
> I have a patch (with unit test) for the above proposal, if everyone thinks that is an
okay solution.
> An additional idea would be to add some retry logic inside the PeersWatcher class for
the nodeChildrenChanged method.  Thus if there happens to be some issue we could sort it out
without having to bounce that particular regionserver.  
> Would love to hear everyones thoughts.

This message was sent by Atlassian JIRA

View raw message