crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Durfey (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-588) Modify HFileUtils to flex on affected regions for hfiles, rather than all regions
Date Thu, 04 Feb 2016 20:36:39 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stephen Durfey updated CRUNCH-588:
----------------------------------
    Attachment: crunch-588-0.14-SNAPSHOT_v2.patch

It certainly makes sense that the start keys would be retrieved in sorted order, but I cannot
find anything to indicate they are. So, I changed the comparator to be the one from Bytes
(I didn't originally see that comparator, so, thanks for pointing that out). 

I can log a JIRA to code against Table and RegionLocator and start working on that. Should
the scope of that be contained to this class, or cover all classes that still use the deprecated
HTable and HTableInterface?

> Modify HFileUtils to flex on affected regions for hfiles, rather than all regions
> ---------------------------------------------------------------------------------
>
>                 Key: CRUNCH-588
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-588
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stephen Durfey
>            Assignee: Josh Wills
>         Attachments: crunch-588-0.14-SNAPSHOT.patch, crunch-588-0.14-SNAPSHOT_v2.patch,
hfileutils_0.8.5.patch
>
>
> HFileUtils when preparing for writing HFiles sets the [number of reducers | https://github.com/apache/crunch/blob/master/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HFileUtils.java#L422]
equal to the number of regions in the table, and then writes out the start keys for each region
to a sequence file for the TotalOrderPartitioner to consume when partitioning data. This can
result in a very large quantity of reducers that don't do anything due to not having any data
to write to hfiles for the region its partition belonged to. 
> My proposal is to modify HFileUtils, with an optional parameter (or a config, that's
up for debate) to determine which regions data will be loaded into ahead of time, and set
the number of reducers to equal the number of regions, and only write out the start keys for
those affected regions. 
> I have working code to do this on the 0.8.x branch of crunch, as that is what I am currently
on. I can modify it to work on more recent versions, but I wanted to start a discussion around
the viability of this code being contributed back to the community. I am still in process
of capturing metrics around the impact of the change (and trying to get data large enough
to test this out), but at least from a reducer count I have seen substantial drops in my limited
testing so far. For example, I had a job go from 705 reduce tasks during the write down to
36 reduce tasks. 
> I've attached what I have so far as of 0.8.4. I'm going to start working on a version
modified for the latest version of crunch. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message