Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: crunch-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of gabriel.reid@gmail.com
 designates 209.85.210.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAH29n6Oij06qiTrL9e_kPv7PhYidQ0bkLDAj=Uhn4_D5MOkBZw@mail.gmail.com>
References: 
 <a22ac30a-761f-4a99-96eb-9993b97e865a@nl1g2000pbc.googlegroups.com>
	<CAH29n6Oij06qiTrL9e_kPv7PhYidQ0bkLDAj=Uhn4_D5MOkBZw@mail.gmail.com>
Date: Sat, 16 Jun 2012 10:49:25 +0200
Message-ID: 
 <CAA5C_pv+sbptABDaM54hnR4+zUGL-X22ZeSm8E9hcp5Sq0MPmQ@mail.gmail.com>
Subject: Re: map side ("replicated") joins in Crunch
From: Gabriel Reid <gabriel.reid@gmail.com>
To: crunch-dev@incubator.apache.org
Cc: Joseph Adler <joseph.adler@gmail.com>,
 Crunch Dev <crunch-dev@cloudera.org>,
	Josh Wills <jwills@cloudera.com>
Content-Type: text/plain; charset=ISO-8859-1

>> One of the functions that I find most useful in Pig is the map side
>> join; Pig will put a file in the distributed cache, load it into
>> memory, and do a join from the mappers. I'd like to add this to
>> Crunch, but wasn't sure what the best way to do this would be. Do any
>> of you guys have any thoughts on this?
>
> I have a few, but they're not quite baked yet. We should have some
> other folks weigh in.
>

Map-side joins is definitely #1 on my wish list of things for Crunch,
and it's also something I've been thinking about a lot lately in terms
of how to implement it.

One of the ideas that I've had about this is adding an overload of the
join method on PTable to allow supplying join settings, for example
something like this:

    JoinSettings joinSettings = new JoinSettings();
    joinSettings.setJoinOperation(JoinSettings.LeftOuterJoin);
    joinSettings.allowMapsideJoin();
    PTable<K, Pair<U,V>> joined = tableA.join(tableB, joinSettings);

The idea is that you could let Crunch decide (at the time of job
creation) if a join would be done in memory or not, depending on the
size of (one of) the incoming tables or any other heuristics. If a
join is performed with a JoinSettings that has allowMapsideJoin set,
then obviously the developer needs to be aware that there is a good
chance that the joined table won't be sorted (which will be the case
if a standard join is used).

Obviously this approach removes some control from the user in terms of
what exactly happens under the covers, so that's something that we
would need to take into account. However, in my day job situations
come up quite often where we're using the same code to deal with both
large joins and small joins depending on the dataset, so it would be
nice to use the same Crunch flow for all cases. Of course, it's also
an option to just write this explicitly instead of baking it directly
into Crunch.

In any case, I'm definitely in favor of having map-side joins be
possible (and easy) with PCollections, and not only with
java.util.Collections. There are definitely use cases where you have a
huge dataset that you want to reduce/aggregate down to a small dataset
and then join with another huge dataset.

Definitely happy to hear that other people are interested in having
map-side joins as well!

- Gabriel