hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13153) Bulk Loaded HFile Replication
Date Sat, 05 Dec 2015 00:03:10 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042453#comment-15042453

Jerry He commented on HBASE-13153:

I have a use case that this feature would be quite useful. 
We have a SQL on Hadoop/HBase.  When inserting into HBase, we try to be smart and optimize
using bulk load some times.
For example, when doing 'INSERT INTO my-hbase-table SELECT  col1 from table1', we will try
to see if the cardinalities are big (say > 20000). If yes, we will generate hfile to bulk
load, not running table puts.
The problem is that replication will not kick in for this new data.  
For across cluster bulk load, people would probably use an external tool (e.g distCp) to move
the MR generated hfiles to the target cluster. 
But in this case, it would be difficult to save and transport the hfiles for bulk load to
the peer cluster since they are generated on-the-fly inside the SQL engine.
So this is a good feature to have.

Regarding the network latency and impact on HBase instances, I think we should add notes/best
practice/warning in the release notes. Mention that potentially large files need to copied
over the network by HBase handlers, and potential impact on the source and peer clusters.
And recommendations like the rpc timeout values need to be increased.

> Bulk Loaded HFile Replication
> -----------------------------
>                 Key: HBASE-13153
>                 URL: https://issues.apache.org/jira/browse/HBASE-13153
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>            Reporter: sunhaitao
>            Assignee: Ashish Singhi
>             Fix For: 2.0.0, 1.3.0
>         Attachments: HBASE-13153-branch-1-v18.patch, HBASE-13153-v1.patch, HBASE-13153-v10.patch,
HBASE-13153-v11.patch, HBASE-13153-v12.patch, HBASE-13153-v13.patch, HBASE-13153-v14.patch,
HBASE-13153-v15.patch, HBASE-13153-v16.patch, HBASE-13153-v17.patch, HBASE-13153-v18.patch,
HBASE-13153-v2.patch, HBASE-13153-v3.patch, HBASE-13153-v4.patch, HBASE-13153-v5.patch, HBASE-13153-v6.patch,
HBASE-13153-v7.patch, HBASE-13153-v8.patch, HBASE-13153-v9.patch, HBASE-13153.patch, HBase
Bulk Load Replication-v1-1.pdf, HBase Bulk Load Replication-v2.pdf, HBase Bulk Load Replication-v3.pdf,
HBase Bulk Load Replication.pdf, HDFS_HA_Solution.PNG
> Currently we plan to use HBase Replication feature to deal with disaster tolerance scenario.But
we encounter an issue that we will use bulkload very frequently,because bulkload bypass write
path, and will not generate WAL, so the data will not be replicated to backup cluster. It's
inappropriate to bukload twice both on active cluster and backup cluster. So i advise do some
modification to bulkload feature to enable bukload to both active cluster and backup cluster

This message was sent by Atlassian JIRA

View raw message