crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-213) Add sharded join functionality
Date Sat, 08 Jun 2013 07:09:20 GMT


Gabriel Reid commented on CRUNCH-213:

Chao, thanks for going over the code.

You're correct that the sharding in PreShardRightSideFn isn't deterministic. However, I don't
think that it's a problem here. If speculative execution is enabled, or if a map task fails
and is retried, only a single version of the map output will be committed by the OutputCommitter,
and nothing will be copied over by the reducer until the map output has been committed, so
I think that non-deterministic map output is safe in this case. If I'm missing something there,
please correct me though.

I don't like having a random in there in general, but I'm actually pretty convinced that a
"real" random is actually exactly what is needed. The first try that I did used the hash code
of the value in the right-side table to determine the shard, but this suffered from the same
issue as skewed joins in general -- if the same value is repeated a lot, then most of the
data would go to that one shard.
> Add sharded join functionality
> ------------------------------
>                 Key: CRUNCH-213
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-213.patch
> Performing joins where a large proportion of the values on one or both sides of the join
are mapped to a single key can result in poor performance, as one (or a small number) of reducers
end up handling most of the joining work, leaving the rest of the cluster idle.
> Sharded joining should be added to allow splitting up join keys, thereby distributing
values mapped to a single key over multiple reducer partitions.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message