Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B8573173DB for ; Sun, 19 Jul 2015 14:21:04 +0000 (UTC) Received: (qmail 63260 invoked by uid 500); 19 Jul 2015 14:21:04 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 63215 invoked by uid 500); 19 Jul 2015 14:21:04 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 63198 invoked by uid 500); 19 Jul 2015 14:21:04 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 63195 invoked by uid 99); 19 Jul 2015 14:21:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Jul 2015 14:21:04 +0000 Date: Sun, 19 Jul 2015 14:21:04 +0000 (UTC) From: "Gabriel Reid (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CRUNCH-545) Writing to HFiles starts a job per column family MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabriel Reid updated CRUNCH-545: -------------------------------- Attachment: pre.dot.png post.dot.png CRUNCH-545.patch Patch to reduce the writing of HFiles to a single job, regardless of which column families are defined on the output table. Also adds testing of writing multiple column families in an HFile load. See pre.dot.png for how writing data for an HTable with 3 column families looked before the patch, and post.dot.png for how it looks after the patch. > Writing to HFiles starts a job per column family > ------------------------------------------------ > > Key: CRUNCH-545 > URL: https://issues.apache.org/jira/browse/CRUNCH-545 > Project: Crunch > Issue Type: Improvement > Reporter: Gabriel Reid > Assignee: Gabriel Reid > Attachments: CRUNCH-545.patch, post.dot.png, pre.dot.png > > > When writing to HFiles via {{HFileUtils.writeToHFilesForIncrementalLoad}}, a separate MR job is started up per column family defined for the table, regardless of whether or not there is any data for each of these column families. > Each of the column family jobs runs over the full set of Cells, filters for the desired column family, and then partitions the data. > For tables with multiple column families, it would be a lot more efficient to sort/partition all of the data together, and then split it out per column family afterwards. -- This message was sent by Atlassian JIRA (v6.3.4#6332)