hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9591) [replication] getting "Current list of sinks is out of date" all the time when a source is recovered
Date Mon, 23 Sep 2013 14:50:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774597#comment-13774597
] 

Gabriel Reid commented on HBASE-9591:
-------------------------------------

Sorry for being so slow at looking at this.

I just finally took a closer look, and now I'm clear on what's going on. I think I'm leaning
towards managing each cluster fully separately (i.e. having a separate ReplicationPeers instance
per peer cluster), but I'm wondering what kind of impact that would have on resource usage.
At first glance, it looks like it should be fine. I think that taking this approach will be
better in terms of avoiding other variations of this bug in the future, which could be something
that would happen if we do the "noop if chooseSinks returns the same thing" approach.

On the other hand, the "noop if chooseSinks returns the same thing" approach will probably
be quite a bit easier.

Do you have a personal preference for the approach, or ideas on what would be "best"?
                
> [replication] getting "Current list of sinks is out of date" all the time when a source
is recovered
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-9591
>                 URL: https://issues.apache.org/jira/browse/HBASE-9591
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.96.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Minor
>             Fix For: 0.96.1
>
>
> I tried killing a region server when the slave cluster was down, from that point on my
log was filled with:
> {noformat}
> 2013-09-20 00:31:03,942 INFO  [regionserver60020.replicationSource,1] org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager:
Current list of sinks is out of date, updating
> 2013-09-20 00:31:04,226 INFO  [ReplicationExecutor-0.replicationSource,1-jdec2hbase0403-4,60020,1379636329634]
org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager: Current list of sinks
is out of date, updating
> {noformat}
> The first log line is from the normal source, the second is the recovered one. When we
try to replicate, we call replicationSinkMgr.getReplicationSink() and if the list of machines
was refreshed since the last time then we call chooseSinks() which in turn refreshes the list
of sinks and resets our lastUpdateToPeers. The next source will notice the change, and will
call chooseSinks() too. The first source is coming for another round, sees the list was refreshed,
calls chooseSinks() again. It happens forever until the recovered queue is gone.
> We could have all the sources going to the same cluster share a thread-safe ReplicationSinkManager.
We could also manage the same cluster separately for each source. Or even easier, if the list
we get in chooseSinks() is the same we had before, consider it a noop.
> What do you think [~gabriel.reid]?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message