crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-213) Add sharded join functionality
Date Sat, 08 Jun 2013 16:42:20 GMT


Josh Wills commented on CRUNCH-213:

A few things:

1) The new patch that uses TaskAttemptID for the seed gives me an abstract method error with
hadoop-2. I think we need to update the TaskAttemptContext generated by the MemPipeline to
support reading TaskAttemptID:

testAvroJoin_MemPipeline(org.apache.crunch.lib.join.ShardedInnerJoinIT)  Time elapsed: 0.363
sec  <<< ERROR!
        at org.apache.crunch.DoFn.getTaskAttemptID(
        at org.apache.crunch.lib.join.ShardedJoinStrategy$PreShardRightSideFn.initialize(
        at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(
        at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(
        at org.apache.crunch.lib.join.ShardedJoinStrategy.join(
        at org.apache.crunch.lib.join.JoinTester.join(
        at org.apache.crunch.lib.join.JoinTester.testAvroJoin_MemPipeline(

2) I think that the ordering of the replicated and streamed inputs needs to be reversed in
the ShardedJoinStrategy: by my read of the default join strategy, the left table is cached
in memory, and the right one is streamed through each reducer and compared to the cached key-value
pairs for the left. I think that means we want to shard the left table and replicate the right

3) We're not providing a key sampling strategy to go along with the sharded join strategy
yet, right? I'm fine with leaving that as a plugin and something we can do in a followup JIRA,
just wanted to check and make sure I hadn't missed it.
> Add sharded join functionality
> ------------------------------
>                 Key: CRUNCH-213
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-213.patch, CRUNCH-213.patch
> Performing joins where a large proportion of the values on one or both sides of the join
are mapped to a single key can result in poor performance, as one (or a small number) of reducers
end up handling most of the joining work, leaving the rest of the cluster idle.
> Sharded joining should be added to allow splitting up join keys, thereby distributing
values mapped to a single key over multiple reducer partitions.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message