Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Sat, 5 Dec 2015 00:03:10 +0000 (UTC)
From: "Jerry He (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12779355.1425467541000.277677.1449273790986@Atlassian.JIRA>
In-Reply-To: <JIRA.12779355.1425467541000@Atlassian.JIRA>
References: <JIRA.12779355.1425467541000@Atlassian.JIRA>
 <JIRA.12779355.1425467541610@arcas>
Subject: [jira] [Commented] (HBASE-13153) Bulk Loaded HFile Replication
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042453#comment-15042453 ] 

Jerry He commented on HBASE-13153:
----------------------------------

I have a use case that this feature would be quite useful. 
We have a SQL on Hadoop/HBase.  When inserting into HBase, we try to be smart and optimize using bulk load some times.
For example, when doing 'INSERT INTO my-hbase-table SELECT  col1 from table1', we will try to see if the cardinalities are big (say > 20000). If yes, we will generate hfile to bulk load, not running table puts.
The problem is that replication will not kick in for this new data.  
For across cluster bulk load, people would probably use an external tool (e.g distCp) to move the MR generated hfiles to the target cluster. 
But in this case, it would be difficult to save and transport the hfiles for bulk load to the peer cluster since they are generated on-the-fly inside the SQL engine.
So this is a good feature to have.

Regarding the network latency and impact on HBase instances, I think we should add notes/best practice/warning in the release notes. Mention that potentially large files need to copied over the network by HBase handlers, and potential impact on the source and peer clusters. And recommendations like the rpc timeout values need to be increased.

> Bulk Loaded HFile Replication
> -----------------------------
>
>                 Key: HBASE-13153
>                 URL: https://issues.apache.org/jira/browse/HBASE-13153
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>            Reporter: sunhaitao
>            Assignee: Ashish Singhi
>             Fix For: 2.0.0, 1.3.0
>
>         Attachments: HBASE-13153-branch-1-v18.patch, HBASE-13153-v1.patch, HBASE-13153-v10.patch, HBASE-13153-v11.patch, HBASE-13153-v12.patch, HBASE-13153-v13.patch, HBASE-13153-v14.patch, HBASE-13153-v15.patch, HBASE-13153-v16.patch, HBASE-13153-v17.patch, HBASE-13153-v18.patch, HBASE-13153-v2.patch, HBASE-13153-v3.patch, HBASE-13153-v4.patch, HBASE-13153-v5.patch, HBASE-13153-v6.patch, HBASE-13153-v7.patch, HBASE-13153-v8.patch, HBASE-13153-v9.patch, HBASE-13153.patch, HBase Bulk Load Replication-v1-1.pdf, HBase Bulk Load Replication-v2.pdf, HBase Bulk Load Replication-v3.pdf, HBase Bulk Load Replication.pdf, HDFS_HA_Solution.PNG
>
>
> Currently we plan to use HBase Replication feature to deal with disaster tolerance scenario.But we encounter an issue that we will use bulkload very frequently,because bulkload bypass write path, and will not generate WAL, so the data will not be replicated to backup cluster. It's inappropriate to bukload twice both on active cluster and backup cluster. So i advise do some modification to bulkload feature to enable bukload to both active cluster and backup cluster


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)