spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yahsuan, chang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-18712) support short circuit for sql expression
Date Mon, 05 Dec 2016 04:38:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

yahsuan, chang updated SPARK-18712:
-----------------------------------
    Description: 
The following python code fails with spark 2.0.2, but works with spark 1.5.2

{code}
# a.py
import pyspark
import pyspark.sql.functions as F
import pyspark.sql.types as T

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)

table = {5: True, 6: False}
df = sqlc.range(10)
df = df.where(df['id'].isin(5, 6))
f = F.udf(lambda x: table[x], T.BooleanType())
df = df.where(f(df['id']))
# df.explain(True)
print(df.count())
{code}

here is the exception 
{code}
KeyError: 0
{code}

I guess the problem is about the order of sql expression.
the following are the explain of two spark version

{code}
# explain of spark 2.0.2
== Parsed Logical Plan ==
Filter <lambda>(id#0L)
+- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
   +- Range (0, 10, step=1, splits=Some(4))

== Analyzed Logical Plan ==
id: bigint
Filter <lambda>(id#0L)
+- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
   +- Range (0, 10, step=1, splits=Some(4))

== Optimized Logical Plan ==
Filter (id#0L IN (5,6) && <lambda>(id#0L))
+- Range (0, 10, step=1, splits=Some(4))

== Physical Plan ==
*Project [id#0L]
+- *Filter (id#0L IN (5,6) && pythonUDF0#5)
   +- BatchEvalPython [<lambda>(id#0L)], [id#0L, pythonUDF0#5]
      +- *Range (0, 10, step=1, splits=Some(4))
{code}

{code}
# explain of spark 1.5.2
== Parsed Logical Plan ==
'Project [*,PythonUDF#<lambda>(id#0L) AS sad#1]
 Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
  LogicalRDD [id#0L], MapPartitionsRDD[3] at range at NativeMethodAccessorImpl.java:-2

== Analyzed Logical Plan ==
id: bigint, sad: int
Project [id#0L,sad#1]
 Project [id#0L,pythonUDF#2 AS sad#1]
  EvaluatePython PythonUDF#<lambda>(id#0L), pythonUDF#2
   Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
    LogicalRDD [id#0L], MapPartitionsRDD[3] at range at NativeMethodAccessorImpl.java:-2

== Optimized Logical Plan ==
Project [id#0L,pythonUDF#2 AS sad#1]
 EvaluatePython PythonUDF#<lambda>(id#0L), pythonUDF#2
  Filter id#0L IN (5,6)
   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at NativeMethodAccessorImpl.java:-2

== Physical Plan ==
TungstenProject [id#0L,pythonUDF#2 AS sad#1]
 !BatchPythonEvaluation PythonUDF#<lambda>(id#0L), [id#0L,pythonUDF#2]
  Filter id#0L IN (5,6)
   Scan PhysicalRDD[id#0L]

Code Generation: true
{code}

  was:
The following python code fails with spark 2.0.2, but works with spark 1.5.2

{code}
# a.py
import pyspark
import pyspark.sql.functions as F
import pyspark.sql.types as T

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)

table = {5: True, 6: False}
df = sqlc.range(10)
df = df.where(df['id'].isin(5, 6))
# df.cache()
f = F.udf(lambda x: table[x], T.BooleanType())
df = df.where(f(df['id']))
# df.explain(True)
print(df.count())
{code}

here is the exception 

but it works if I uncomment df.cache()


> support short circuit for sql expression
> ----------------------------------------
>
>                 Key: SPARK-18712
>                 URL: https://issues.apache.org/jira/browse/SPARK-18712
>             Project: Spark
>          Issue Type: Wish
>          Components: SQL
>    Affects Versions: 2.0.2
>         Environment: Ubuntu 16.04
>            Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter <lambda>(id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>    +- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter <lambda>(id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>    +- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && <lambda>(id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>    +- BatchEvalPython [<lambda>(id#0L)], [id#0L, pythonUDF0#5]
>       +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#<lambda>(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#<lambda>(id#0L), pythonUDF#2
>    Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>     LogicalRDD [id#0L], MapPartitionsRDD[3] at range at NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#<lambda>(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>    LogicalRDD [id#0L], MapPartitionsRDD[3] at range at NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#<lambda>(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>    Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message