Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2EA418BE0 for ; Wed, 14 Oct 2015 17:16:50 +0000 (UTC) Received: (qmail 84959 invoked by uid 500); 14 Oct 2015 17:16:13 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 84860 invoked by uid 500); 14 Oct 2015 17:16:13 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 84850 invoked by uid 99); 14 Oct 2015 17:16:13 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Oct 2015 17:16:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E5F791A2314 for ; Wed, 14 Oct 2015 17:16:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.98 X-Spam-Level: ** X-Spam-Status: No, score=2.98 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id xA9AZ2w36llb for ; Wed, 14 Oct 2015 17:16:07 +0000 (UTC) Received: from mail-lf0-f44.google.com (mail-lf0-f44.google.com [209.85.215.44]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 040CE204DA for ; Wed, 14 Oct 2015 17:16:07 +0000 (UTC) Received: by lffv3 with SMTP id v3so15986210lff.0 for ; Wed, 14 Oct 2015 10:16:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=QdeJ0WGx+o7e0sxgjFWea6p4W2avEsev1HvogMfhq+A=; b=SZQXXGWTQxlim5Fpb9rJjAGzgI1fbsWKsFahAItmRaeKznogIH3ZNHfT7x3kzVLldl k4oJkfqyIe0q/+4dwSs9XRRuzkmCr+SwEXtNeFPR6tO22baBj3l85J/UqLk0ijqkQeki cx3B/C6o7F96jiC5HHnvmSIzxiwsTB2tii7OqJ4nf1aBrFAJrB+jmMExYQMbpUcCFYHk jrC2u5P2ZPapoaOmHjrxKlq5dNO9pDPsNShdHPztjJlW+0OYWrjmUharltpl1cBOS8Zk E+CNVXP/usQFcNJVlRREZhahmyqoDJGtQ9qhyJES4mN6Ko/25oEsaZQHCzT/f0euPUjo t7XA== X-Gm-Message-State: ALoCoQkO0CH0oxlulmRok+j3vAZsbjKCV4x9vNAhHjIDAVB+FNz82qLf8EX93DzFHcfvfNRJtaKz X-Received: by 10.25.159.79 with SMTP id i76mr1499651lfe.0.1444842966445; Wed, 14 Oct 2015 10:16:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.81.146 with HTTP; Wed, 14 Oct 2015 10:15:46 -0700 (PDT) In-Reply-To: References: From: Michael Armbrust Date: Wed, 14 Oct 2015 10:15:46 -0700 Message-ID: Subject: Re: Spark DataFrame GroupBy into List To: java8964 Cc: SLiZn Liu , "user@spark.apache.org" Content-Type: multipart/alternative; boundary=001a114110bcc58dd7052213b758 --001a114110bcc58dd7052213b758 Content-Type: text/plain; charset=UTF-8 Thats correct. It is a Hive UDAF. On Wed, Oct 14, 2015 at 6:45 AM, java8964 wrote: > My guess is the same as UDAF of (collect_set) in Hive. > > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) > > Yong > > ------------------------------ > From: sliznmailbox@gmail.com > Date: Wed, 14 Oct 2015 02:45:48 +0000 > Subject: Re: Spark DataFrame GroupBy into List > To: michael@databricks.com > CC: user@spark.apache.org > > > Hi Michael, > > Can you be more specific on `collect_set`? Is it a built-in function or, > if it is an UDF, how it is defined? > > BR, > Todd Leo > > On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust > wrote: > > import org.apache.spark.sql.functions._ > > df.groupBy("category") > .agg(callUDF("collect_set", df("id")).as("id_list")) > > On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu > wrote: > > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list > instead of count. > > Let's say we have a dataframe as shown below: > > | category | id | > | -------- |:--:| > | A | 1 | > | A | 2 | > | B | 3 | > | B | 4 | > | C | 5 | > > ideally, after some magic group by (reverse explode?): > > | category | id_list | > | -------- | -------- | > | A | 1,2 | > | B | 3,4 | > | C | 5 | > > any tricks to achieve that? Scala Spark API is preferred. =D > > BR, > Todd Leo > > > > > --001a114110bcc58dd7052213b758 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thats correct.=C2=A0 It is a Hive UDAF.

On Wed, Oct 14, 2015 at 6:45 A= M, java8964 <java8964@hotmail.com> wrote:
Yong


From: sliznmailbox@gmail.com
Date: Wed, 14 O= ct 2015 02:45:48 +0000
Subject: Re: Spark DataFrame GroupBy into ListTo: michael@da= tabricks.com
CC: user@spark.apache.org


Hi Michael,=C2=A0

Can you be more specific on `col= lect_set`? Is it a built-in function or, if it is an UDF, how it is defined= ?

BR,
Todd Leo

On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <michael@databricks.com> wrote:

On Mon, Oct 12, 2015 at 11:08 P= M, SLiZn Liu <sliznmailbox@gmail.com> wrote:
Hey Spark users,

I'm trying to group by a dataframe= , by appending occurrences into a list instead of count.=C2=A0
Let's say we have a dataframe as shown below:
| category | id |
| -------- |:--:|
| A        | 1  |
| A        | 2  |
| B        | 3  |
| B        | 4  |
| C        | 5  |
ideally, after some magic group by (r=
everse explode?):
| category | id_list  |
| -------- | -------- |
| A        | 1,2      |
| B        | 3,4      |
| C        | 5        |
any tricks to achieve that= ? Scala Spark API is preferred. =3DD

BR,
Todd Leo




--001a114110bcc58dd7052213b758--