hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhupendra Kumar Jain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13153) enable bulkload to support replication
Date Wed, 02 Sep 2015 13:21:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727319#comment-14727319

Bhupendra Kumar Jain commented on HBASE-13153:

Thanks all for the review and nice comments. 
bq. Since cyclic replication topologies are supported today I think we'd need that handled
for the bulk load case too or it will lead to subtle and not so subtle problems for users.
To detect the cyclic replication case, we will make use of hbase cluster's unique id. [ Same
as WAL replication]
The unique cluster id of all source hbase clusters will be persisted in ZK of Peer cluster
under hfile replication node.

For an example ->	c1->c2->c3->c1 is the cyclic replication case
So when file f1 is bulk loaded to c1 cluster and then from c1->c2 and c2->c3, below
is the sample zk node data
||Cluster||hfile node data
|c1|	f1,{NONE}
|c2|	f1,{c1}
|c3|	f1,{c1,c2}

When c3 tries to replicate the f1 to c1, it will detect the cycle and will not process further.
Unique cluster id of all sources will be passed to next replication request. 

bq. A crazy idea: rather than have bulk load tooling produce only HFiles for replication,
why not HFiles for the local cluster and ready made WALs to queue up for replication? Of course
that's going to have some drawbacks too but I think fewer.
We thought similar ideas initially, but didn't take this approach because,
This way we will not get the benefit of bulk load. If we simulate the bulk load hfile replication
as WAL, then it will actually become many Puts in peer cluster and not bulk load. But as per
our approach, the hfile will be copied and loaded to peer cluster similar as Complete Bulk
Load flow, so it will have same benefit of bulk load mechanism.

bq. Could sequence Id be used so that the HFiles don't need to be written again ?
As we think, To detect the cycle, Sequence ID can not be used because Sequence id's for hfile
will be different across clusters and it doesn't provide any hint of source cluster.

bq. Few things to consider, ensure that if there is block encoding then the encoding scheme
is same in both the tables. These type of conditions may come in the initial pre checks that
we may need to add.
This scenario is similar to changing the encoding in one running hbase cluster. Some hfiles
will be of X encoding and others will be of Y encoding. Each hfile is aware of its encoding
type. As I know, this is already handled as part of hfile read. So replication of hfile should
not have any issue. Correct me if I am missing anything ? 

> enable bulkload to support replication
> --------------------------------------
>                 Key: HBASE-13153
>                 URL: https://issues.apache.org/jira/browse/HBASE-13153
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>            Reporter: sunhaitao
>            Assignee: Ashish Singhi
>             Fix For: 2.0.0
>         Attachments: HBase Bulk Load Replication.pdf
> Currently we plan to use HBase Replication feature to deal with disaster tolerance scenario.But
we encounter an issue that we will use bulkload very frequently,because bulkload bypass write
path, and will not generate WAL, so the data will not be replicated to backup cluster. It's
inappropriate to bukload twice both on active cluster and backup cluster. So i advise do some
modification to bulkload feature to enable bukload to both active cluster and backup cluster

This message was sent by Atlassian JIRA

View raw message