spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <>
Subject [jira] [Commented] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them
Date Thu, 27 Apr 2017 08:30:04 GMT


Takeshi Yamamuro commented on SPARK-20312:

I checked there was no issue in the current master. IIUC this issue has been fixed in SPARK-20359.
This fix has already been applied into branch-2.1

> query optimizer calls udf with null values when it doesn't expect them
> ----------------------------------------------------------------------
>                 Key: SPARK-20312
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Albert Meltzer
> When optimizing an outer join, spark passes an empty row to both sides to see if nulls
would be ignored (side comment: for half-outer joins it subsequently ignores the assessment
on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT NULL}}
might result in checking the right side first, and an exception if the udf doesn't expect
a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT NULL
added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message