spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching
Date Sun, 20 Sep 2015 01:37:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-8632:
-------------------------------
    Priority: Blocker  (was: Major)

> Poor Python UDF performance because of RDD caching
> --------------------------------------------------
>
>                 Key: SPARK-8632
>                 URL: https://issues.apache.org/jira/browse/SPARK-8632
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0
>            Reporter: Justin Uang
>            Assignee: Davies Liu
>            Priority: Blocker
>
> {quote}
> We have been running into performance problems using Python UDFs with DataFrames at large
scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was to reuse
the PythonRDD code. It caches the entire child RDD so that it can do two passes over the data.
One to give to the PythonRDD, then one to join the python lambda results with the original
row (which may have java objects that should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be processed
by the Python UDF. In the cases I was working with, I had a 500 column table, and i wanted
to use a python UDF for one column, and it ended up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message