Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Date: Sun, 19 Jul 2015 14:21:04 +0000 (UTC)
From: "Gabriel Reid (JIRA)" <jira@apache.org>
To: crunch-dev@incubator.apache.org
Message-ID: <JIRA.12846158.1437315146000.229704.1437315664538@Atlassian.JIRA>
In-Reply-To: <JIRA.12846158.1437315146000@Atlassian.JIRA>
References: <JIRA.12846158.1437315146000@Atlassian.JIRA>
 <JIRA.12846158.1437315146274@arcas>
Subject: [jira] [Updated] (CRUNCH-545) Writing to HFiles starts a job per
 column family
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CRUNCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Reid updated CRUNCH-545:
--------------------------------
    Attachment: pre.dot.png
                post.dot.png
                CRUNCH-545.patch

Patch to reduce the writing of HFiles to a single job, regardless of which column families are defined on the output table. Also adds testing of writing multiple column families in an HFile load.

See pre.dot.png for how writing data for an HTable with 3 column families looked before the patch, and post.dot.png for how it looks after the patch.

> Writing to HFiles starts a job per column family
> ------------------------------------------------
>
>                 Key: CRUNCH-545
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-545
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-545.patch, post.dot.png, pre.dot.png
>
>
> When writing to HFiles via {{HFileUtils.writeToHFilesForIncrementalLoad}}, a separate MR job is started up per column family defined for the table, regardless of whether or not there is any data for each of these column families.
> Each of the column family jobs runs over the full set of Cells, filters for the desired column family, and then partitions the data.
> For tables with multiple column families, it would be a lot more efficient to sort/partition all of the data together, and then split it out per column family afterwards.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)