incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Sorting output of WordCount example
Date Wed, 14 Nov 2012 14:21:28 GMT
Hey Ashish,

The sort function operates on the keys, which are already sorted. For
getting the maximum values from a PTable<String, Long>, there is the
built-in top(N) function, where N is the number of entries that you
want returned, which will be faster than doing a full sort when you
only want the top several values. To do a full sort on the values, you
would need to swap the keys and the values and then call sort,
something like this:

PTable<String, Long> counts = ...;
PTable<Long, String> switched = counts.parallelDo(new
MapFn<Pair<String, Long>, Pair<Long, String>>() {
  @Override public Pair<Long, String> map(Pair<String, Long> input) {
return Pair.of(input.second(), input.first()); } },
  Avros.tableOf(Avros.longs(), Avros.strings()));
switched.sort();

I'm not sure how common the full sort-on-value is relative to just
getting a sample of the values via top(), but I could certainly see
adding the key-value switching logic to org.apache.crunch.lib.PTables,
and would gladly accept a patch to do that.

Josh

On Wed, Nov 14, 2012 at 3:36 AM, Ashish <paliwalashish@gmail.com> wrote:
> Hi All,
>
> I am newbie to Crunch. Have been playing wit examples in standalone mode.
> Was trying to extend the WordCount example, but got stuck.
>
> I want to extend the WordCount example to sort the output by max word count.
> I tried using (PCollections.sort)
>
> PTable<String, Long> counts = words.count();
> words.sort(false);
>
> This code had no effect. Using crunch-0.4 release (under voting)
>
> Is there a simple way to achieve this, or need to modify the code according
> to SecondarySort example.
>
> Already have a blog post based on WordCount, want to extend the same example
> for sorting, so that it's easy to understand.
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Mime
View raw message