crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-545) Writing to HFiles starts a job per column family
Date Sun, 19 Jul 2015 14:13:04 GMT
Gabriel Reid created CRUNCH-545:
-----------------------------------

             Summary: Writing to HFiles starts a job per column family
                 Key: CRUNCH-545
                 URL: https://issues.apache.org/jira/browse/CRUNCH-545
             Project: Crunch
          Issue Type: Improvement
            Reporter: Gabriel Reid
            Assignee: Gabriel Reid


When writing to HFiles via {{HFileUtils.writeToHFilesForIncrementalLoad}}, a separate MR job
is started up per column family defined for the table, regardless of whether or not there
is any data for each of these column families.

Each of the column family jobs runs over the full set of Cells, filters for the desired column
family, and then partitions the data.

For tables with multiple column families, it would be a lot more efficient to sort/partition
all of the data together, and then split it out per column family afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message