hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cuijianwei (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12770) Don't transfer all the queued hlogs of a dead server to the same alive server
Date Fri, 09 Jan 2015 06:08:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270607#comment-14270607

cuijianwei commented on HBASE-12770:

It seems that the client-side has no methods to manipulate replication queues, to implement
a queue reassign command, we might need to add some methods in ReplicationQueuesClient to
move queue between rs from client-side, and the rs might need to watch its rsZnode to know
the queue removing/adding event to terminate/start a ReplicationSource.
As another way, we might make the client-side not directly move the queue:
1. the client-side create a 'stop-queueId' node under the rsZnode of source rs;
2. the rs need to watch its rsZnode and know the 'stop-queueId' node create event, the source
rs then terminate the RepliationSource and create a 'transfer-queueId' node under the rsZnode
of target rs and delete the 'stop-queueId' node;
3. after knowing the 'transfer-queueId' node create event, the target rs transfer the queue
under its rsZnode and delete 'transfer-queueId'.
I am not sure which way is more favorable under the current replication implementation, what
do you think about this? [~apurtell] [~jdcryans]

> Don't transfer all the queued hlogs of a dead server to the same alive server
> -----------------------------------------------------------------------------
>                 Key: HBASE-12770
>                 URL: https://issues.apache.org/jira/browse/HBASE-12770
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>            Reporter: cuijianwei
>            Priority: Minor
>         Attachments: HBASE-12770-trunk.patch
> When a region server is down(or the cluster restart), all the hlog queues will be transferred
by the same alive region server. In a shared cluster, we might create several peers replicating
data to different peer clusters. There might be lots of hlogs queued for these peers caused
by several reasons, such as some peers might be disabled, or errors from peer cluster might
prevent the replication, or the replication sources may fail to read some hlog because of
hdfs problem. Then, if the server is down or restarted, another alive server will take all
the replication jobs of the dead server, this might bring a big pressure to resources(network/disk
read) of the alive server and also is not fast enough to replicate the queued hlogs. And if
the alive server is down, all the replication jobs including that takes from other dead servers
will once again be totally transferred to another alive server, this might cause a server
have a large number of queued hlogs(in our shared cluster, we find one server might have thousands
of queued hlogs for replication). As an optional way, is it reasonable that the alive server
only transfer one peer's hlogs from the dead server one time? Then, other alive region servers
might have the opportunity to transfer the hlogs of rest peers. This may also help the queued
hlogs be processed more fast. Any discussion is welcome.

This message was sent by Atlassian JIRA

View raw message