Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A1FF18393 for ; Sat, 5 Dec 2015 00:03:11 +0000 (UTC) Received: (qmail 6947 invoked by uid 500); 5 Dec 2015 00:03:11 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 6896 invoked by uid 500); 5 Dec 2015 00:03:11 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 6860 invoked by uid 99); 5 Dec 2015 00:03:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 Dec 2015 00:03:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id F18942C1F61 for ; Sat, 5 Dec 2015 00:03:10 +0000 (UTC) Date: Sat, 5 Dec 2015 00:03:10 +0000 (UTC) From: "Jerry He (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-13153) Bulk Loaded HFile Replication MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042453#comment-15042453 ] Jerry He commented on HBASE-13153: ---------------------------------- I have a use case that this feature would be quite useful. We have a SQL on Hadoop/HBase. When inserting into HBase, we try to be smart and optimize using bulk load some times. For example, when doing 'INSERT INTO my-hbase-table SELECT col1 from table1', we will try to see if the cardinalities are big (say > 20000). If yes, we will generate hfile to bulk load, not running table puts. The problem is that replication will not kick in for this new data. For across cluster bulk load, people would probably use an external tool (e.g distCp) to move the MR generated hfiles to the target cluster. But in this case, it would be difficult to save and transport the hfiles for bulk load to the peer cluster since they are generated on-the-fly inside the SQL engine. So this is a good feature to have. Regarding the network latency and impact on HBase instances, I think we should add notes/best practice/warning in the release notes. Mention that potentially large files need to copied over the network by HBase handlers, and potential impact on the source and peer clusters. And recommendations like the rpc timeout values need to be increased. > Bulk Loaded HFile Replication > ----------------------------- > > Key: HBASE-13153 > URL: https://issues.apache.org/jira/browse/HBASE-13153 > Project: HBase > Issue Type: New Feature > Components: Replication > Reporter: sunhaitao > Assignee: Ashish Singhi > Fix For: 2.0.0, 1.3.0 > > Attachments: HBASE-13153-branch-1-v18.patch, HBASE-13153-v1.patch, HBASE-13153-v10.patch, HBASE-13153-v11.patch, HBASE-13153-v12.patch, HBASE-13153-v13.patch, HBASE-13153-v14.patch, HBASE-13153-v15.patch, HBASE-13153-v16.patch, HBASE-13153-v17.patch, HBASE-13153-v18.patch, HBASE-13153-v2.patch, HBASE-13153-v3.patch, HBASE-13153-v4.patch, HBASE-13153-v5.patch, HBASE-13153-v6.patch, HBASE-13153-v7.patch, HBASE-13153-v8.patch, HBASE-13153-v9.patch, HBASE-13153.patch, HBase Bulk Load Replication-v1-1.pdf, HBase Bulk Load Replication-v2.pdf, HBase Bulk Load Replication-v3.pdf, HBase Bulk Load Replication.pdf, HDFS_HA_Solution.PNG > > > Currently we plan to use HBase Replication feature to deal with disaster tolerance scenario.But we encounter an issue that we will use bulkload very frequently,because bulkload bypass write path, and will not generate WAL, so the data will not be replicated to backup cluster. It's inappropriate to bukload twice both on active cluster and backup cluster. So i advise do some modification to bulkload feature to enable bukload to both active cluster and backup cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)