incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: Sorting output of WordCount example
Date Wed, 14 Nov 2012 15:34:53 GMT
Thanks Josh !

This helps. I am slowly getting a hang of things. top() function would
do,for the time being
As of now my full focus is to understand how stuff works and complete my
blog series on Crunch.

Let's see if a patch comes out of this :)

thanks
ashish


On Wed, Nov 14, 2012 at 7:51 PM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Ashish,
>
> The sort function operates on the keys, which are already sorted. For
> getting the maximum values from a PTable<String, Long>, there is the
> built-in top(N) function, where N is the number of entries that you
> want returned, which will be faster than doing a full sort when you
> only want the top several values. To do a full sort on the values, you
> would need to swap the keys and the values and then call sort,
> something like this:
>
> PTable<String, Long> counts = ...;
> PTable<Long, String> switched = counts.parallelDo(new
> MapFn<Pair<String, Long>, Pair<Long, String>>() {
>   @Override public Pair<Long, String> map(Pair<String, Long> input) {
> return Pair.of(input.second(), input.first()); } },
>   Avros.tableOf(Avros.longs(), Avros.strings()));
> switched.sort();
>
> I'm not sure how common the full sort-on-value is relative to just
> getting a sample of the values via top(), but I could certainly see
> adding the key-value switching logic to org.apache.crunch.lib.PTables,
> and would gladly accept a patch to do that.
>
> Josh
>
> On Wed, Nov 14, 2012 at 3:36 AM, Ashish <paliwalashish@gmail.com> wrote:
> > Hi All,
> >
> > I am newbie to Crunch. Have been playing wit examples in standalone mode.
> > Was trying to extend the WordCount example, but got stuck.
> >
> > I want to extend the WordCount example to sort the output by max word
> count.
> > I tried using (PCollections.sort)
> >
> > PTable<String, Long> counts = words.count();
> > words.sort(false);
> >
> > This code had no effect. Using crunch-0.4 release (under voting)
> >
> > Is there a simple way to achieve this, or need to modify the code
> according
> > to SecondarySort example.
> >
> > Already have a blog post based on WordCount, want to extend the same
> example
> > for sorting, so that it's easy to understand.
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog
> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Mime
View raw message