incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: JoinFn queries
Date Fri, 16 Nov 2012 16:44:05 GMT
Thanks Gabriel !

Let me try these tricks out.

Seems like we can have a section for Crunch Recipes or Crunchies :)


On Fri, Nov 16, 2012 at 9:17 PM, Gabriel Reid <gabriel.reid@gmail.com>wrote:

> Hi Ashish,
>
> Answers to your questions inlined below.
>
> > 1. Which all join function need one of the PTables in memory? from
> > documentation, I could get MapsideJoin has this.
>
> MapsideJoin is indeed the only join implementation that loads a side
> of the join into memory. The other (core) join implementations rely
> fully on the MapReduce framework to bring linked records together.
> Obviously this means that MapsideJoin should only be used if one side
> of your join is small enough to fit in memory (although in this case,
> you can get much better performance).
>
> > 2. I am playing around with JoinFn to merge two datasets, scenario is
> > detailed below.
> >
> > Scenario: Cooked this up to play around with Crunch
> >
> > One file has Ads Returned and time stamp in format
> > <Ad Id>, <long timestamp>
> >
> > Other file has just Ad Ids, for which impressions were received
> > <Ad Id>
> >
> > The objective is to join the data so that we can know which Ads got
> > impressions and impression table would be 90%(random) the size of Ads
> table.
> > In short, the table cannot fit in memory.
> >
> > The way I did the join is, load both of them in PTable. For Ads returned
> > table (Ad Id, timestamp) and for Impression Table, its Ad Id and an
> Integer
> >
> > And join them using the code
> >
> > PTable<String, Pair<Long, Long>> joinedData =
> > Join.leftJoin(adsReturnedTable, impressionTable);
> >
> > return is Ad Id, timestamp, Is Impressed
> >
>
> The approach that you're taking sounds good, and should scale up
> without problems.
>
>
> > The code is working for small test data set. One problem I am facing is,
> for
> > the Ad Ids, where impression is not present, the output is like
> >
> > a18f1f89-21e1-4fa9-8d24-54702fb9bdeb [1353062206438,]
> >
> > for other it's
> > f2978128-6e40-4edb-ad3a-5e0ce5e11440 [1353062206479,1]
> >
> > a. How can I make a 0 (zero) appear when the match is not found. From my
> > exploration, I need to write join(), and add check on pair.second() while
> > emitting. Is there a another way for achieve this.
>
> The impression value is null in the value pair in this case. You can
> replace this with a zero by doing something like the following calling
> parallelDo on the joined PTable with your own subclass of MapFn. The
> MapFn subclass just needs to replace the Pair containing a null with a
> Pair containing a 0 as the second value.
>
> > 3. How can be hook custom output formatter while writing PTable. like for
> > the above output, want to get something like
> >
> > f2978128-6e40-4edb-ad3a-5e0ce5e11440,1353062206479,1
>
> The easiest way to do this is to just implement a MapFn that does the
> necessary string formatting in the map method, and then apply it to
> the PTable just before you write the output.
>
> > I plan to publish the finished code and all the finding in 4th blog post
> on
> > crunch.
>
> Cool! Spreading the word about Crunch definitely sounds good.
>
> Hope all this helps, and let me know if anything isn't clear.
>
> Regards,
>
> Gabriel
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Mime
View raw message