Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 262F610B52 for ; Mon, 10 Jun 2013 07:03:34 +0000 (UTC) Received: (qmail 5144 invoked by uid 500); 10 Jun 2013 07:03:31 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 3624 invoked by uid 500); 10 Jun 2013 07:03:26 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 3249 invoked by uid 500); 10 Jun 2013 07:03:22 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 3228 invoked by uid 99); 10 Jun 2013 07:03:22 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jun 2013 07:03:22 +0000 Date: Mon, 10 Jun 2013 07:03:22 +0000 (UTC) From: "Gabriel Reid (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-215) Add BloomFilterJoinStrategy MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679353#comment-13679353 ] Gabriel Reid commented on CRUNCH-215: ------------------------------------- Thanks for looking at it Josh. As for #1, yeah, I was banging my head on that for quite a while. I tried writing some integration tests to replicate what I was running into, but I wasn't able to get the same problem. My best guess is that this is specifically an issue with the integration tests and the shape of the pipeline that is used there, although I didn't check that specifically. For #2, that was actually what I had in mind at first, but then it felt like it was too specific to this one use case -- are there any other use cases where we want to have a toBytes method? > Add BloomFilterJoinStrategy > --------------------------- > > Key: CRUNCH-215 > URL: https://issues.apache.org/jira/browse/CRUNCH-215 > Project: Crunch > Issue Type: New Feature > Reporter: Gabriel Reid > Assignee: Gabriel Reid > Attachments: CRUNCH-215.patch > > > Bloom filters can be very effective for pre-filtering one side of a join when one side of the join has a small subset of the keys of the other side (i.e. there are many keys on one side that will not be joined). > The Bloom filter can be built up based on the keys of one side of the join (the side with fewer keys), and then can be applied as a filter to the other side of the join before it is sent through the shuffle and reduce phases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira