crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kidong Lee <mykid...@gmail.com>
Subject Re: secondary sort in crunch on spark.
Date Wed, 24 Jun 2015 13:20:25 GMT
thanks for your reply, your answer is very helpful to understand it.

I have another question.  Is there any plan to support Tez on which crunch
can be run?

- Kidong.





2015-06-23 23:06 GMT+09:00 Josh Wills <jwills@cloudera.com>:

> Hey Kidong,
>
> The short answer is that we cheat. The class to look at for the
> implementation details is:
>
>
> https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java
>
> ...and you sort of have to walk through three different tricks we do to
> make MapReduce partitioners, sorting classes, and grouping classes-- all of
> which we use in the secondary sort implementation-- to work on Spark.
>
> J
>
> On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <dpo5003@gmail.com> wrote:
>
>> Correct me if I'm wrong, but if you are using an avro record or a Tuple
>> data structure, couldn't you get a secondary sort by just sticking the
>> fields in the order you want to apply the sort, and then using the regular
>> sort api?  For example, if I had say, itemid, itemprice, nosold and I
>> wanted to do something like....
>>
>> select itemid, itemprice, sum(nosold) from table group by itemid,
>> itemprice, order by itemid, itemprice asc;
>>
>> I could implement that as...
>> PTable<Pair<Integer, Double>, Long> items = {...some code to load the
>> data into this
>> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
>> get something similar right?
>>
>>
>> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <mykidong@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have been using spark to implement our recommendation algorithm, for
>>> which it was hard to get secondary sort by value, thus, I have implemented
>>> this algorithm with the help of hive.
>>> I think, spark does not support secondary sort yet.
>>>
>>> I have recently implemented the same recommendation algorithm in crunch
>>> running on spark with using crunch secondary sort API.
>>>
>>> I am wondering how to implement secondary sort in crunch running on
>>> spark.
>>>
>>> Anybody can give me some explanations about the implementation of
>>> secondary sort in crunch spark?
>>>
>>> thanks,
>>>
>>> - Kidong.
>>>
>>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message