spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20396) groupBy().apply() with pandas udf in pyspark
Date Tue, 17 Oct 2017 14:31:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207711#comment-16207711
] 

Apache Spark commented on SPARK-20396:
--------------------------------------

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19517

> groupBy().apply() with pandas udf in pyspark
> --------------------------------------------
>
>                 Key: SPARK-20396
>                 URL: https://issues.apache.org/jira/browse/SPARK-20396
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.1.0
>            Reporter: Li Jin
>            Assignee: Li Jin
>             Fix For: 2.3.0
>
>
> split-apply-merge is a common pattern when analyzing data. It is implemented in many
popular data analyzing libraries such as Spark, Pandas, R, and etc. Split and merge operations
in these libraries are similar to each other, mostly implemented by certain grouping operators.
For instance, Spark DataFrame has groupBy, Pandas DataFrame has groupby. Therefore, for users
familiar with either Spark DataFrame or pandas DataFrame, it is not difficult for them to
understand how grouping works in the other library. However, apply is more native to different
libraries and therefore, quite different between libraries. A pandas user knows how to use
apply to do curtain transformation in pandas might not know how to do the same using pyspark.
Also, the current implementation of passing data from the java executor to python executor
is not efficient, there is opportunity to speed it up using Apache Arrow. This feature can
enable use cases that uses Spark's grouping operators such as groupBy, rollUp, cube, window
and Pandas's native apply operator.
> Related work:
> SPARK-13534
> This enables faster data serialization between Pyspark and Pandas using Apache Arrow.
Our work will be on top of this and use the same serialization for pandas udf.
> SPARK-12919 and SPARK-12922
> These implemented two functions: dapply and gapply in Spark R which implements the similar
split-apply-merge pattern that we want to implement with Pyspark. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message