hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cuijianwei (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12769) Replication fails to delete all corresponding zk nodes when peer is removed
Date Mon, 05 Jan 2015 08:13:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264324#comment-14264324

cuijianwei commented on HBASE-12769:

I tried to fix the problem following the above steps, it seems many places of current implementation
need to be changed if adding a "REMOVING" peer state, such as we must avoid enable/disable
a peer in "REMOVING" state, the PeersWatcher should also be changed to know this state, etc.
We must be careful to keep every place right after adding "REMOVING" state. Maybe, we can
try another way to solve the problem:
1. The logic of removing peer won't change, it is also the regionserver's responsibility to
delete hlog queues of removed peer;
2. When adding a new peer, we can firstly check all replication rs znode and throw exception
if there are uncleaned queues belongs to the peer, this will prevent adding a new peer before
old queues cleaned;
3. Teach HBCK to discover the uncleaned queues for removed peer and fix it.
How do you think about these steps? [~apurtell]

> Replication fails to delete all corresponding zk nodes when peer is removed
> ---------------------------------------------------------------------------
>                 Key: HBASE-12769
>                 URL: https://issues.apache.org/jira/browse/HBASE-12769
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>    Affects Versions: 0.99.2
>            Reporter: cuijianwei
>            Priority: Minor
> When removing a peer, the client side will delete peerId under peersZNode node; then
alive region servers will be notified and delete corresponding hlog queues under its rsZNode
of replication. However, if there are failed servers whose hlog queues have not been transferred
by alive servers(this likely happens if setting a big value to "replication.sleep.before.failover"
and lots of region servers restarted), these hlog queues won't be deleted after the peer is
removed. I think remove_peer should guarantee all corresponding zk nodes have been removed
after it completes; otherwise, if we create a new peer with the same peerId with the removed
one, there might be unexpected data to be replicated.

This message was sent by Atlassian JIRA

View raw message