crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-545) Writing to HFiles starts a job per column family
Date Sun, 19 Jul 2015 14:21:04 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriel Reid updated CRUNCH-545:
--------------------------------
    Attachment: pre.dot.png
                post.dot.png
                CRUNCH-545.patch

Patch to reduce the writing of HFiles to a single job, regardless of which column families
are defined on the output table. Also adds testing of writing multiple column families in
an HFile load.

See pre.dot.png for how writing data for an HTable with 3 column families looked before the
patch, and post.dot.png for how it looks after the patch.

> Writing to HFiles starts a job per column family
> ------------------------------------------------
>
>                 Key: CRUNCH-545
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-545
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-545.patch, post.dot.png, pre.dot.png
>
>
> When writing to HFiles via {{HFileUtils.writeToHFilesForIncrementalLoad}}, a separate
MR job is started up per column family defined for the table, regardless of whether or not
there is any data for each of these column families.
> Each of the column family jobs runs over the full set of Cells, filters for the desired
column family, and then partitions the data.
> For tables with multiple column families, it would be a lot more efficient to sort/partition
all of the data together, and then split it out per column family afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message