crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: secondary sort in crunch on spark.
Date Tue, 23 Jun 2015 14:06:29 GMT
Hey Kidong,

The short answer is that we cheat. The class to look at for the
implementation details is:

...and you sort of have to walk through three different tricks we do to
make MapReduce partitioners, sorting classes, and grouping classes-- all of
which we use in the secondary sort implementation-- to work on Spark.


On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <> wrote:

> Correct me if I'm wrong, but if you are using an avro record or a Tuple
> data structure, couldn't you get a secondary sort by just sticking the
> fields in the order you want to apply the sort, and then using the regular
> sort api?  For example, if I had say, itemid, itemprice, nosold and I
> wanted to do something like....
> select itemid, itemprice, sum(nosold) from table group by itemid,
> itemprice, order by itemid, itemprice asc;
> I could implement that as...
> PTable<Pair<Integer, Double>, Long> items = {...some code to load the data
> into this
> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
> get something similar right?
> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <> wrote:
>> Hi,
>> I have been using spark to implement our recommendation algorithm, for
>> which it was hard to get secondary sort by value, thus, I have implemented
>> this algorithm with the help of hive.
>> I think, spark does not support secondary sort yet.
>> I have recently implemented the same recommendation algorithm in crunch
>> running on spark with using crunch secondary sort API.
>> I am wondering how to implement secondary sort in crunch running on spark.
>> Anybody can give me some explanations about the implementation of
>> secondary sort in crunch spark?
>> thanks,
>> - Kidong.

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message