spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Li (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument
Date Wed, 10 May 2017 23:53:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiao Li resolved SPARK-20685.
-----------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0
                   2.2.1
                   2.1.2

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-20685
>                 URL: https://issues.apache.org/jira/browse/SPARK-20685
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>             Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a stage with
a single UDF that takes more than one argument _where that argument is repeated_ will crash
at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line
180, in main
>     process()
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line
175, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line
107, in <lambda>
>     func = lambda _, it: map(mapper, it)
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line
93, in <lambda>
>     mapper = lambda a: udf(*a)
>   File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line
71, in <lambda>
>     return lambda *a: f(*a)
> TypeError: <lambda>() takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path for handling
a "batch UDF evaluation consisting of a single Python UDF, but that branch incorrectly assumes
that a single UDF won't have repeated arguments and therefore skips the code for unpacking
arguments from the input row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message