Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 880FF90BD for ; Sat, 16 Jun 2012 08:49:54 +0000 (UTC) Received: (qmail 9145 invoked by uid 500); 16 Jun 2012 08:49:54 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 9091 invoked by uid 500); 16 Jun 2012 08:49:53 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 9039 invoked by uid 99); 16 Jun 2012 08:49:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Jun 2012 08:49:52 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gabriel.reid@gmail.com designates 209.85.210.47 as permitted sender) Received: from [209.85.210.47] (HELO mail-pz0-f47.google.com) (209.85.210.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Jun 2012 08:49:45 +0000 Received: by dalh21 with SMTP id h21so4780940dal.6 for ; Sat, 16 Jun 2012 01:49:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=f5sAgXYiEWFywtrm9AM/af3NkWvqhamipM2FWx04KuM=; b=lSTyVCsB5lXilJoMFFBY+pWyAZdN9hJoFdTZQAu4vp7Di8TrtZbd9OUnRUj/OB2Aa9 ISm3M9rAf1dz8IgMbw/hTor6Qc7giKISAza/Cg/sdoqRrdUYY88cyHum3SFV68E13xb5 5ITE8dDD7QWZMlTWzU3POxv1kpfTG4bmC6Hc4/do33bRn8uGRKsfftdMwxEi4dZokrTt 8F2szKkM0CtQVU03BVcEcQovMzqfcGNXBrqwAdzuHxbhuUaczvZvicj08L8XYYFDkKPS rBXaWtFksEycChWT50iGIMK1stXuOjax2ziu/9h3UCmCg2YRUt28AkDfZYLZ+fOxyw3Q +Oxg== MIME-Version: 1.0 Received: by 10.68.203.73 with SMTP id ko9mr29363003pbc.66.1339836565467; Sat, 16 Jun 2012 01:49:25 -0700 (PDT) Received: by 10.142.221.20 with HTTP; Sat, 16 Jun 2012 01:49:25 -0700 (PDT) In-Reply-To: References: Date: Sat, 16 Jun 2012 10:49:25 +0200 Message-ID: Subject: Re: map side ("replicated") joins in Crunch From: Gabriel Reid To: crunch-dev@incubator.apache.org Cc: Joseph Adler , Crunch Dev , Josh Wills Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org >> One of the functions that I find most useful in Pig is the map side >> join; Pig will put a file in the distributed cache, load it into >> memory, and do a join from the mappers. I'd like to add this to >> Crunch, but wasn't sure what the best way to do this would be. Do any >> of you guys have any thoughts on this? > > I have a few, but they're not quite baked yet. We should have some > other folks weigh in. > Map-side joins is definitely #1 on my wish list of things for Crunch, and it's also something I've been thinking about a lot lately in terms of how to implement it. One of the ideas that I've had about this is adding an overload of the join method on PTable to allow supplying join settings, for example something like this: JoinSettings joinSettings = new JoinSettings(); joinSettings.setJoinOperation(JoinSettings.LeftOuterJoin); joinSettings.allowMapsideJoin(); PTable> joined = tableA.join(tableB, joinSettings); The idea is that you could let Crunch decide (at the time of job creation) if a join would be done in memory or not, depending on the size of (one of) the incoming tables or any other heuristics. If a join is performed with a JoinSettings that has allowMapsideJoin set, then obviously the developer needs to be aware that there is a good chance that the joined table won't be sorted (which will be the case if a standard join is used). Obviously this approach removes some control from the user in terms of what exactly happens under the covers, so that's something that we would need to take into account. However, in my day job situations come up quite often where we're using the same code to deal with both large joins and small joins depending on the dataset, so it would be nice to use the same Crunch flow for all cases. Of course, it's also an option to just write this explicitly instead of baking it directly into Crunch. In any case, I'm definitely in favor of having map-side joins be possible (and easy) with PCollections, and not only with java.util.Collections. There are definitely use cases where you have a huge dataset that you want to reduce/aggregate down to a small dataset and then join with another huge dataset. Definitely happy to hear that other people are interested in having map-side joins as well! - Gabriel