hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11963) Synchronize peer cluster replication connection attempts
Date Sun, 14 Sep 2014 05:19:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133088#comment-14133088

Lars Hofhansl commented on HBASE-11963:

Also lemme explain what happened:
* We have a ReplicationPeer per slave cluster
* We have a ReplicationSource for every "queue" to replicate. A queue is either the data this
region wishes to replicate or data it took over for another region server (for example when
that region server went down)
* When we take over a queue from another region server we have *multiple* ReplicationSources
replicating to the *same* set of ReplicationPeers.
* When the slave cluster is down, the ReplicationSources attempt to reset their peers upon
each failed request.
* And hence now we have race where multiple ReplicationSources attempt to reconnect a peer
simultaneously. That caused the race condition and leaked ZK clients.
* Each of the leaked client would attempt to reconnect to the slave once/sec until the ZK
timeout (defaulting to 180s).

So this only happens when (a) we have some queues failed over from another region server *and*
(b) a peer is not currently reachable (or there are some other ZK issues) causing the source
and reconnect its peer.
But if we have this condition it gets nasty pretty quickly.

> Synchronize peer cluster replication connection attempts
> --------------------------------------------------------
>                 Key: HBASE-11963
>                 URL: https://issues.apache.org/jira/browse/HBASE-11963
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Purtell
>            Assignee: Maddineni Sukumar
>             Fix For: 2.0.0, 0.98.7, 0.94.24, 0.99.1
>         Attachments: 11963-0.94.txt, HBASE-11963-0.98.patch, HBASE-11963.patch
> Synchronize peer cluster connection attempts to avoid races and rate limit connections
when multiple replication sources try to connect to the peer cluster. If the peer cluster
is down we can get out of control over time.

This message was sent by Atlassian JIRA

View raw message