Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C2E710B69 for ; Mon, 10 Jun 2013 14:48:24 +0000 (UTC) Received: (qmail 5210 invoked by uid 500); 10 Jun 2013 14:48:23 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 4735 invoked by uid 500); 10 Jun 2013 14:48:23 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 4632 invoked by uid 500); 10 Jun 2013 14:48:22 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 4622 invoked by uid 99); 10 Jun 2013 14:48:22 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jun 2013 14:48:22 +0000 Date: Mon, 10 Jun 2013 14:48:22 +0000 (UTC) From: "Josh Wills (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-215) Add BloomFilterJoinStrategy MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679553#comment-13679553 ] Josh Wills commented on CRUNCH-215: ----------------------------------- My thought was that we would want it for MSCR fusion, since we could take multiple groupByKey operations over the same dataset and map them all to a PTable, ByteBuffer> (or the logical equivalent) in the MSCRPlanner class, and then undo the byte mapping on the reduce side and continue on our merry way. There would be restrictions on our ability to do this (e.g., we couldn't do it for joins or other jobs that need custom partitioners/sorts), but I think it would still be generally useful. Again, not necessary for this JIRA, but we should keep it in mind for the future. > Add BloomFilterJoinStrategy > --------------------------- > > Key: CRUNCH-215 > URL: https://issues.apache.org/jira/browse/CRUNCH-215 > Project: Crunch > Issue Type: New Feature > Reporter: Gabriel Reid > Assignee: Gabriel Reid > Attachments: CRUNCH-215.patch > > > Bloom filters can be very effective for pre-filtering one side of a join when one side of the join has a small subset of the keys of the other side (i.e. there are many keys on one side that will not be joined). > The Bloom filter can be built up based on the keys of one side of the join (the side with fewer keys), and then can be applied as a filter to the other side of the join before it is sent through the shuffle and reduce phases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira