Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 17 Oct 2017 14:31:00 +0000 (UTC)
From: "Apache Spark (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13065168.1492623349000.16645.1508250660428@Atlassian.JIRA>
In-Reply-To: <JIRA.13065168.1492623349000@Atlassian.JIRA>
References: <JIRA.13065168.1492623349000@Atlassian.JIRA> <JIRA.13065168.1492623349175@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-20396) groupBy().apply() with pandas udf
 in pyspark
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Tue, 17 Oct 2017 14:31:08 -0000


    [ https://issues.apache.org/jira/browse/SPARK-20396?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D162=
07711#comment-16207711 ]=20

Apache Spark commented on SPARK-20396:
--------------------------------------

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19517

> groupBy().apply() with pandas udf in pyspark
> --------------------------------------------
>
>                 Key: SPARK-20396
>                 URL: https://issues.apache.org/jira/browse/SPARK-20396
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.1.0
>            Reporter: Li Jin
>            Assignee: Li Jin
>             Fix For: 2.3.0
>
>
> split-apply-merge is a common pattern when analyzing data. It is implemen=
ted in many popular data analyzing libraries such as Spark, Pandas, R, and =
etc. Split and merge operations in these libraries are similar to each othe=
r, mostly implemented by certain grouping operators. For instance, Spark Da=
taFrame has groupBy, Pandas DataFrame has groupby. Therefore, for users fam=
iliar with either Spark DataFrame or pandas DataFrame, it is not difficult =
for them to understand how grouping works in the other library. However, ap=
ply is more native to different libraries and therefore, quite different be=
tween libraries. A pandas user knows how to use apply to do curtain transfo=
rmation in pandas might not know how to do the same using pyspark. Also, th=
e current implementation of passing data from the java executor to python e=
xecutor is not efficient, there is opportunity to speed it up using Apache =
Arrow. This feature can enable use cases that uses Spark's grouping operato=
rs such as groupBy, rollUp, cube, window and Pandas's native apply operator=
.
> Related work:
> SPARK-13534
> This enables faster data serialization between Pyspark and Pandas using A=
pache Arrow. Our work will be on top of this and use the same serialization=
 for pandas udf.
> SPARK-12919 and SPARK-12922
> These implemented two functions: dapply and gapply in Spark R which imple=
ments the similar split-apply-merge pattern that we want to implement with =
Pyspark.=20


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org