incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Sorting output of WordCount example
Date Wed, 14 Nov 2012 14:21:28 GMT
Hey Ashish,

The sort function operates on the keys, which are already sorted. For
getting the maximum values from a PTable<String, Long>, there is the
built-in top(N) function, where N is the number of entries that you
want returned, which will be faster than doing a full sort when you
only want the top several values. To do a full sort on the values, you
would need to swap the keys and the values and then call sort,
something like this:

PTable<String, Long> counts = ...;
PTable<Long, String> switched = counts.parallelDo(new
MapFn<Pair<String, Long>, Pair<Long, String>>() {
  @Override public Pair<Long, String> map(Pair<String, Long> input) {
return Pair.of(input.second(), input.first()); } },
  Avros.tableOf(Avros.longs(), Avros.strings()));

I'm not sure how common the full sort-on-value is relative to just
getting a sample of the values via top(), but I could certainly see
adding the key-value switching logic to org.apache.crunch.lib.PTables,
and would gladly accept a patch to do that.


On Wed, Nov 14, 2012 at 3:36 AM, Ashish <> wrote:
> Hi All,
> I am newbie to Crunch. Have been playing wit examples in standalone mode.
> Was trying to extend the WordCount example, but got stuck.
> I want to extend the WordCount example to sort the output by max word count.
> I tried using (PCollections.sort)
> PTable<String, Long> counts = words.count();
> words.sort(false);
> This code had no effect. Using crunch-0.4 release (under voting)
> Is there a simple way to achieve this, or need to modify the code according
> to SecondarySort example.
> Already have a blog post based on WordCount, want to extend the same example
> for sorting, so that it's easy to understand.
> --
> thanks
> ashish
> Blog:
> My Photo Galleries:

Director of Data Science
Twitter: @josh_wills

View raw message