crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-213) Add sharded join functionality
Date Sat, 08 Jun 2013 09:16:20 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriel Reid updated CRUNCH-213:
--------------------------------

    Attachment: CRUNCH-213.patch

I like the idea of using the taskId is the seed for the random to ensure having deterministic
behaviour, even if it isn't strictly necessary. Here's an updated patch using the task id
as the seed.
                
> Add sharded join functionality
> ------------------------------
>
>                 Key: CRUNCH-213
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-213
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-213.patch, CRUNCH-213.patch
>
>
> Performing joins where a large proportion of the values on one or both sides of the join
are mapped to a single key can result in poor performance, as one (or a small number) of reducers
end up handling most of the joining work, leaving the rest of the cluster idle.
> Sharded joining should be added to allow splitting up join keys, thereby distributing
values mapped to a single key over multiple reducer partitions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message