spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Sell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
Date Wed, 17 Aug 2016 03:52:20 GMT
Tim Sell created SPARK-17100:
--------------------------------

             Summary: pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
                 Key: SPARK-17100
                 URL: https://issues.apache.org/jira/browse/SPARK-17100
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
         Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
            Reporter: Tim Sell


In pyspark, when filtering on a udf derived column after some join types,
the optimized logical plan results is a java.lang.UnsupportedOperationException.

I could not replicate this in scala code from the shell, just python. It is a pyspark regression
from spark 1.6.2.

This can be replicated with: bin/spark-submit bug.py

{code:python:title=bug.py}
import pyspark.sql.functions as F
from pyspark.sql import Row, SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.appName("test").getOrCreate()
    left = spark.createDataFrame([Row(a=1)])
    right = spark.createDataFrame([Row(a=1)])
    df = left.join(right, on='a', how='left_outer')
    df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
    df = df.filter('b = "x"')
    df.explain(extended=True)
{code}

The output is:
{code}
== Parsed Logical Plan ==
'Filter ('b = x)
+- Project [a#0L, <lambda>(a#0L) AS b#8]
   +- Project [a#0L]
      +- Join LeftOuter, (a#0L = a#3L)
         :- LogicalRDD [a#0L]
         +- LogicalRDD [a#3L]

== Analyzed Logical Plan ==
a: bigint, b: string
Filter (b#8 = x)
+- Project [a#0L, <lambda>(a#0L) AS b#8]
   +- Project [a#0L]
      +- Join LeftOuter, (a#0L = a#3L)
         :- LogicalRDD [a#0L]
         +- LogicalRDD [a#3L]

== Optimized Logical Plan ==
java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0,
bigint, true])
== Physical Plan ==
java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0,
bigint, true])
{code}


It fails when the join is:

* how='outer', on=column expression
* how='left_outer', on=string or column expression
* how='right_outer', on=string or column expression

It passes when the join is:

* how='inner', on=string or column expression
* how='outer', on=string

I made some tests to demonstrate each of these.

Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message