spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacob Eisinger (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-17728) UDFs are run too many times
Date Thu, 29 Sep 2016 17:45:21 GMT

     [ https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jacob Eisinger updated SPARK-17728:
-----------------------------------
    Description: 
h3. Background
The UDF functionality is very useful in Spark. In particular, longer running processes that
might run analytics or contact external services can be used here. The response might not
just be a field, but instead a structure of information. When attempting to break out this
information, it is critical that query is optimized correctly.

h3. Steps to Reproduce
# Create some sample data.
# Create a UDF that returns a multiple attributes.
# Run UDF over some data.
# Create new columns from the multiple attributes.
# Observe run time.

h3. Actual Results
The UDF is executed multiple times **per row.**

h3. Expected Results
The UDF should only be executed once **per row.**

h3. Workaround
Cache the Dataset after UDF execution.

h3. Details
For code and more details, see !

  was:
h3. Background
The UDF functionality is very useful in Spark. In particular, longer running processes that
might run analytics or contact external services can be used here. The response might not
just be a field, but instead a structure of information. When attempting to break out this
information, it is critical that query is optimized correctly.

h3. Steps to Reproduce
# Create some sample data.
# Create a UDF that returns a multiple attributes.
# Run UDF over some data.
# Create new columns from the multiple attributes.
# Observe run time.

h3. Actual Results
The UDF is executed multiple times **per row.**

h3. Expected Results
The UDF should only be executed once **per row.**

h3. Workaround
Cache the Dataset after UDF execution.

h3. Details
See attached Databricks Notebook.


> UDFs are run too many times
> ---------------------------
>
>                 Key: SPARK-17728
>                 URL: https://issues.apache.org/jira/browse/SPARK-17728
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>         Environment: Databricks Cloud / Spark 2.0.0
>            Reporter: Jacob Eisinger
>            Priority: Minor
>         Attachments: Defect - Over Optimized UDF.html
>
>
> h3. Background
> The UDF functionality is very useful in Spark. In particular, longer running processes
that might run analytics or contact external services can be used here. The response might
not just be a field, but instead a structure of information. When attempting to break out
this information, it is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed multiple times **per row.**
> h3. Expected Results
> The UDF should only be executed once **per row.**
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message