Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9220D1828E for ; Wed, 14 Oct 2015 13:46:17 +0000 (UTC) Received: (qmail 59110 invoked by uid 500); 14 Oct 2015 13:45:58 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 59026 invoked by uid 500); 14 Oct 2015 13:45:57 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 59016 invoked by uid 99); 14 Oct 2015 13:45:57 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Oct 2015 13:45:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 75558180A5A for ; Wed, 14 Oct 2015 13:45:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.239 X-Spam-Level: *** X-Spam-Status: No, score=3.239 tagged_above=-999 required=6.31 tests=[FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id RZ8caYoaa8Mj for ; Wed, 14 Oct 2015 13:45:51 +0000 (UTC) Received: from SNT004-OMC1S20.hotmail.com (snt004-omc1s20.hotmail.com [65.55.90.31]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 0459E23052 for ; Wed, 14 Oct 2015 13:45:50 +0000 (UTC) Received: from SNT149-W33 ([65.55.90.7]) by SNT004-OMC1S20.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Wed, 14 Oct 2015 06:45:44 -0700 X-TMN: [NDsFUxE2FcTNWI7vCvsy5Guab5tuFAWv] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_b8108028-f8dc-4a61-a390-d2d772aa5a75_" From: java8964 To: SLiZn Liu , Michael Armbrust CC: "user@spark.apache.org" Subject: RE: Spark DataFrame GroupBy into List Date: Wed, 14 Oct 2015 09:45:44 -0400 Importance: Normal In-Reply-To: References: ,, MIME-Version: 1.0 X-OriginalArrivalTime: 14 Oct 2015 13:45:44.0685 (UTC) FILETIME=[9FCCF9D0:01D10686] --_b8108028-f8dc-4a61-a390-d2d772aa5a75_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable My guess is the same as UDAF of (collect_set) in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Languag= eManualUDF-Built-inAggregateFunctions(UDAF) Yong From: sliznmailbox@gmail.com Date: Wed=2C 14 Oct 2015 02:45:48 +0000 Subject: Re: Spark DataFrame GroupBy into List To: michael@databricks.com CC: user@spark.apache.org Hi Michael=2C=20 Can you be more specific on `collect_set`? Is it a built-in function or=2C = if it is an UDF=2C how it is defined? BR=2CTodd Leo On Wed=2C Oct 14=2C 2015 at 2:12 AM Michael Armbrust wrote: import org.apache.spark.sql.functions._ df.groupBy("category") .agg(callUDF("collect_set"=2C df("id")).as("id_list= ")) On Mon=2C Oct 12=2C 2015 at 11:08 PM=2C SLiZn Liu = wrote: Hey Spark users=2C I'm trying to group by a dataframe=2C by appending occurrences into a list = instead of count.=20 Let's say we have a dataframe as shown below:| category | id |=0A= | -------- |:--:|=0A= | A | 1 |=0A= | A | 2 |=0A= | B | 3 |=0A= | B | 4 |=0A= | C | 5 |=0A= ideally=2C after some magic group by (reverse explode?):| category | id_lis= t |=0A= | -------- | -------- |=0A= | A | 1=2C2 |=0A= | B | 3=2C4 |=0A= | C | 5 |=0A= any tricks to achieve that? Scala Spark API is preferred. =3DD BR=2CTodd Leo=20 =0A= =0A= =0A= = --_b8108028-f8dc-4a61-a390-d2d772aa5a75_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
My guess is the same as UDAF of = (collect_set) in Hive.


Yong


From: sliznmailbox@gmail.com
Date: Wed=2C 14 Oct 2015 02:45:48 +000= 0
Subject: Re: Spark DataFrame GroupBy into List
To: michael@databric= ks.com
CC: user@spark.apache.org

Hi Michael=2C&n= bsp=3B

Can you be more specific on `collect_set`? Is it = a built-in function or=2C if it is an UDF=2C how it is defined?
<= br>
BR=2C
Todd Leo


On Mon=2C Oct 12=2C 2015 at 11:08 PM=2C SLiZn Li= u <=3B
sliznmailbox@gmail.com>=3B wrote:
Hey Spark users=2C

I'm trying t= o group by a dataframe=2C by appending occurrences into a list instead of c= ount. =3B

Let's say we have a dataframe as sho= wn below:
=
| category | id |=0A=
| -------- |:--:|=0A=
| A        | 1  |=0A=
| A        | 2  |=0A=
| B        | 3  |=0A=
| B        | 4  |=0A=
| C        | 5  |=0A=
ideally=2C after some magi=
c group by (reverse explode?):
| categor=
y | id_list  |=0A=
| -------- | -------- |=0A=
| A        | 1=2C2      |=0A=
| B        | 3=2C4      |=0A=
| C        | 5        |=0A=
any trick= s to achieve that? Scala Spark API is preferred. =3DD

BR=2C
Todd Leo

=0A=

=0A=

=0A=
= --_b8108028-f8dc-4a61-a390-d2d772aa5a75_--