incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: JoinFn queries
Date Fri, 16 Nov 2012 15:47:23 GMT
Hi Ashish,

Answers to your questions inlined below.

> 1. Which all join function need one of the PTables in memory? from
> documentation, I could get MapsideJoin has this.

MapsideJoin is indeed the only join implementation that loads a side
of the join into memory. The other (core) join implementations rely
fully on the MapReduce framework to bring linked records together.
Obviously this means that MapsideJoin should only be used if one side
of your join is small enough to fit in memory (although in this case,
you can get much better performance).

> 2. I am playing around with JoinFn to merge two datasets, scenario is
> detailed below.
>
> Scenario: Cooked this up to play around with Crunch
>
> One file has Ads Returned and time stamp in format
> <Ad Id>, <long timestamp>
>
> Other file has just Ad Ids, for which impressions were received
> <Ad Id>
>
> The objective is to join the data so that we can know which Ads got
> impressions and impression table would be 90%(random) the size of Ads table.
> In short, the table cannot fit in memory.
>
> The way I did the join is, load both of them in PTable. For Ads returned
> table (Ad Id, timestamp) and for Impression Table, its Ad Id and an Integer
>
> And join them using the code
>
> PTable<String, Pair<Long, Long>> joinedData =
> Join.leftJoin(adsReturnedTable, impressionTable);
>
> return is Ad Id, timestamp, Is Impressed
>

The approach that you're taking sounds good, and should scale up
without problems.


> The code is working for small test data set. One problem I am facing is, for
> the Ad Ids, where impression is not present, the output is like
>
> a18f1f89-21e1-4fa9-8d24-54702fb9bdeb [1353062206438,]
>
> for other it's
> f2978128-6e40-4edb-ad3a-5e0ce5e11440 [1353062206479,1]
>
> a. How can I make a 0 (zero) appear when the match is not found. From my
> exploration, I need to write join(), and add check on pair.second() while
> emitting. Is there a another way for achieve this.

The impression value is null in the value pair in this case. You can
replace this with a zero by doing something like the following calling
parallelDo on the joined PTable with your own subclass of MapFn. The
MapFn subclass just needs to replace the Pair containing a null with a
Pair containing a 0 as the second value.

> 3. How can be hook custom output formatter while writing PTable. like for
> the above output, want to get something like
>
> f2978128-6e40-4edb-ad3a-5e0ce5e11440,1353062206479,1

The easiest way to do this is to just implement a MapFn that does the
necessary string formatting in the map method, and then apply it to
the PTable just before you write the output.

> I plan to publish the finished code and all the finding in 4th blog post on
> crunch.

Cool! Spreading the word about Crunch definitely sounds good.

Hope all this helps, and let me know if anything isn't clear.

Regards,

Gabriel

Mime
View raw message