Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B0A60113A6 for ; Mon, 2 Jun 2014 05:48:02 +0000 (UTC) Received: (qmail 87495 invoked by uid 500); 2 Jun 2014 05:48:02 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 87446 invoked by uid 500); 2 Jun 2014 05:48:02 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 87436 invoked by uid 99); 2 Jun 2014 05:48:02 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jun 2014 05:48:02 +0000 Date: Mon, 2 Jun 2014 05:48:02 +0000 (UTC) From: "Jerry He (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-8073) HFileOutputFormat support for offline operation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry He updated HBASE-8073: ---------------------------- Attachment: HBASE-8073-trunk-v1.patch > HFileOutputFormat support for offline operation > ----------------------------------------------- > > Key: HBASE-8073 > URL: https://issues.apache.org/jira/browse/HBASE-8073 > Project: HBase > Issue Type: Sub-task > Components: mapreduce > Reporter: Nick Dimiduk > Fix For: 0.99.0 > > Attachments: HBASE-8073-trunk-v0.patch, HBASE-8073-trunk-v1.patch > > > When using HFileOutputFormat to generate HFiles, it inspects the region topology of the target table. The split points from that table are used to guide the TotalOrderPartitioner. If the target table does not exist, it is first created. This imposes an unnecessary dependence on an online HBase and existing table. > If the table exists, it can be used. However, the job can be smarter. For example, if there's far more data going into the HFiles than the table currently contains, the table regions aren't very useful for data split points. Instead, the input data can be sampled to produce split points more meaningful to the dataset. LoadIncrementalHFiles is already capable of handling divergence between HFile boundaries and table regions, so this should not pose any additional burdon at load time. > The proper method of sampling the data likely requires a custom input format and an additional map-reduce job perform the sampling. See a relevant implementation: https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java -- This message was sent by Atlassian JIRA (v6.2#6252)