crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-527) Improve distribution of keys when using default (hash-based) partitioning
Date Fri, 13 Nov 2015 22:24:11 GMT


Josh Wills commented on CRUNCH-527:

I think mostly an unexpected side-effect, although I'm surprised that the Target-based configuration
doesn't trump our default partitioner in the case that it already exists. I hadn't expected
folks to configure the partitioner based on the Target, but I don't doubt there's a good reason
to do that under certain circumstances. Is it something that's easy to convert to a test case
so we can dive into it and make sure the user-configured partitioner always wins, no matter
where it's configured?

> Improve distribution of keys when using default (hash-based) partitioning
> -------------------------------------------------------------------------
>                 Key: CRUNCH-527
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>             Fix For: 0.13.0
>         Attachments: CRUNCH-527.patch
> The default partitioner used for MR-based pipelines bases itself on the hash code of
keys modulo the number of partitions, along the lines of 
> {code}int partition = key.hashCode() % numPartitions{code}
> This approach dependent on the _lower bits_ of the hash code being uniformly distributed.
If the lower bits of the key hash code is not uniformly distributed, the key space will not
be uniformly distributed over the partitions.
> It can be surprisingly easy to get a very poor distribution. For example, if the keys
are integer values and are all divisible by 2, then only half of the partitions will receive
data (as the hash code of an integer is the integer value itself).
> This can even be a problem in situations where you would really not expect it. For example,
taking the byte-array representation of longs for each timestamp of each second over a period
of 24 hours (at millisecond granularity) and partitioning it over 50 partitions results in
34 of the 50 partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that applies a supplementary
hash function to the return value of the key's hashCode method. This same approach is taken
in java.util.HashMap for the same reason.
> Note that this same approach was proposed in MAPREDUCE-4827, but wasn't committed (mostly)
because of backwards compatibility issues (some people may have counted on certain records
showing up in a given output file). Seeing as Crunch is a higher abstraction above MR, I assume
that we don't need to worry about the backwards compatibility issue as much, but there may
be other opinions on this.

This message was sent by Atlassian JIRA

View raw message