spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SLiZn Liu <sliznmail...@gmail.com>
Subject Re: Spark DataFrame GroupBy into List
Date Thu, 15 Oct 2015 02:09:38 GMT
Thanks, Michael and java8964!

Does Hive Context also provides udf for combining existing lists, into
flattened(not nested) list? (list->list of lists -[flatten]->list).

On Thu, Oct 15, 2015 at 1:16 AM Michael Armbrust <michael@databricks.com>
wrote:

> Thats correct.  It is a Hive UDAF.
>
> On Wed, Oct 14, 2015 at 6:45 AM, java8964 <java8964@hotmail.com> wrote:
>
>> My guess is the same as UDAF of (collect_set) in Hive.
>>
>>
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
>>
>> Yong
>>
>> ------------------------------
>> From: sliznmailbox@gmail.com
>> Date: Wed, 14 Oct 2015 02:45:48 +0000
>> Subject: Re: Spark DataFrame GroupBy into List
>> To: michael@databricks.com
>> CC: user@spark.apache.org
>>
>>
>> Hi Michael,
>>
>> Can you be more specific on `collect_set`? Is it a built-in function or,
>> if it is an UDF, how it is defined?
>>
>> BR,
>> Todd Leo
>>
>> On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <michael@databricks.com>
>> wrote:
>>
>> import org.apache.spark.sql.functions._
>>
>> df.groupBy("category")
>>   .agg(callUDF("collect_set", df("id")).as("id_list"))
>>
>> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <sliznmailbox@gmail.com>
>> wrote:
>>
>> Hey Spark users,
>>
>> I'm trying to group by a dataframe, by appending occurrences into a list
>> instead of count.
>>
>> Let's say we have a dataframe as shown below:
>>
>> | category | id |
>> | -------- |:--:|
>> | A        | 1  |
>> | A        | 2  |
>> | B        | 3  |
>> | B        | 4  |
>> | C        | 5  |
>>
>> ideally, after some magic group by (reverse explode?):
>>
>> | category | id_list  |
>> | -------- | -------- |
>> | A        | 1,2      |
>> | B        | 3,4      |
>> | C        | 5        |
>>
>> any tricks to achieve that? Scala Spark API is preferred. =D
>>
>> BR,
>> Todd Leo
>>
>>
>>
>>
>>
>

Mime
View raw message