Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C608DC284 for ; Sat, 8 Jun 2013 19:32:20 +0000 (UTC) Received: (qmail 30997 invoked by uid 500); 8 Jun 2013 19:32:20 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 30928 invoked by uid 500); 8 Jun 2013 19:32:20 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 30770 invoked by uid 500); 8 Jun 2013 19:32:20 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 30766 invoked by uid 99); 8 Jun 2013 19:32:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jun 2013 19:32:20 +0000 Date: Sat, 8 Jun 2013 19:32:20 +0000 (UTC) From: "Gabriel Reid (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-213) Add sharded join functionality MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678828#comment-13678828 ] Gabriel Reid commented on CRUNCH-213: ------------------------------------- Josh, thanks for looking at it. 1) Yep, thanks for catching that on hadoop-2. It is indeed an issue with the generation of the TaskAttemptContext, and I'll upload a patch in a couple of minutes. 2) The default join strategy caches the values for a single key at a time in memory from the left table, not the whole table, so replicating the left side won't result in more memory being used for this caching. The general intention is that the left table is (much) smaller than the right table, so by replicating the left table and sharding the right table, the size of the shuffle stays nearly the same, whereas replicating the right table would make the shuffle much larger (depending on the proportions of the left and right table of course). 3) Correct, there's no default sampling-based strategy yet. Like you thought, I wanted to keep this functionality possible, but not worry about it specifically yet. > Add sharded join functionality > ------------------------------ > > Key: CRUNCH-213 > URL: https://issues.apache.org/jira/browse/CRUNCH-213 > Project: Crunch > Issue Type: New Feature > Reporter: Gabriel Reid > Assignee: Gabriel Reid > Attachments: CRUNCH-213.patch, CRUNCH-213.patch > > > Performing joins where a large proportion of the values on one or both sides of the join are mapped to a single key can result in poor performance, as one (or a small number) of reducers end up handling most of the joining work, leaving the rest of the cluster idle. > Sharded joining should be added to allow splitting up join keys, thereby distributing values mapped to a single key over multiple reducer partitions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira