Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Date: Fri, 13 Nov 2015 23:31:11 +0000 (UTC)
From: "Preston Koprivica (JIRA)" <jira@apache.org>
To: crunch-dev@incubator.apache.org
Message-ID: <JIRA.12832536.1432569380000.67797.1447457471171@Atlassian.JIRA>
In-Reply-To: <JIRA.12832536.1432569380000@Atlassian.JIRA>
References: <JIRA.12832536.1432569380000@Atlassian.JIRA>
 <JIRA.12832536.1432569380673@arcas>
Subject: [jira] [Commented] (CRUNCH-527) Improve distribution of keys when
 using default (hash-based) partitioning
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CRUNCH-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004910#comment-15004910 ] 

Preston Koprivica commented on CRUNCH-527:
------------------------------------------

For full transparency, this is a pretty old Target that existed prior to crunch having good support for HFiles - which requires a TotalOrderPartitioner.  We've got a TODO to get that converted, but continue to rely on the custom target - simply because it's always worked.  We've worked around this issue by configuring the partitioner as a grouping option to override the default.  

TBH, there's always been a slightly unclear division of labor for targets that require specific sort/partitioning guarantees.  The sort/partitioning happens before the target is even applied, but is explicitly required by the target.  The strong guarantee required by the target made me think to apply the partitioner within the target.  On the other hand, I could just as easily justify assuming the guarantees have been met by the time the target is encountered, so apply it to the grouping options.... :) 

In any case, a simple test could be to 1) create a custom Target that specifies a custom partitioner (logic is probably irrelevant), 2) apply it to a pipeline with a groupByKey() and 3) inspect the configuration for the value of the partitioner class on the other side.   

This is simple in that it's only inspecting configuration, but it leaks mapreduce internals pretty heavily.  A more complete test could actually apply a partitioner to a real dataset and inspect the output, but that's probably a bit more involved.

> Improve distribution of keys when using default (hash-based) partitioning
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-527
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-527
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>             Fix For: 0.13.0
>
>         Attachments: CRUNCH-527.patch
>
>
> The default partitioner used for MR-based pipelines bases itself on the hash code of keys modulo the number of partitions, along the lines of 
> {code}int partition = key.hashCode() % numPartitions{code}
> This approach dependent on the _lower bits_ of the hash code being uniformly distributed. If the lower bits of the key hash code is not uniformly distributed, the key space will not be uniformly distributed over the partitions.
> It can be surprisingly easy to get a very poor distribution. For example, if the keys are integer values and are all divisible by 2, then only half of the partitions will receive data (as the hash code of an integer is the integer value itself).
> This can even be a problem in situations where you would really not expect it. For example, taking the byte-array representation of longs for each timestamp of each second over a period of 24 hours (at millisecond granularity) and partitioning it over 50 partitions results in 34 of the 50 partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that applies a supplementary hash function to the return value of the key's hashCode method. This same approach is taken in java.util.HashMap for the same reason.
> Note that this same approach was proposed in MAPREDUCE-4827, but wasn't committed (mostly) because of backwards compatibility issues (some people may have counted on certain records showing up in a given output file). Seeing as Crunch is a higher abstraction above MR, I assume that we don't need to worry about the backwards compatibility issue as much, but there may be other opinions on this.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)