spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-26203) Benchmark performance of In and InSet expressions
Date Tue, 11 Dec 2018 19:11:00 GMT


Apache Spark reassigned SPARK-26203:

    Assignee: Apache Spark

> Benchmark performance of In and InSet expressions
> -------------------------------------------------
>                 Key: SPARK-26203
>                 URL:
>             Project: Spark
>          Issue Type: Test
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Anton Okolnychyi
>            Assignee: Apache Spark
>            Priority: Major
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible values exceeds
"spark.sql.optimizer.inSetConversionThreshold" and all values are literals. This was done
for performance reasons to avoid O\(n\) time complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after that (e.g.,
generation of Java code to evaluate expressions), so it is worth to measure the performance
of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than {{In}}
due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside {{In}} and {{InSet}}
and outline existing bottlenecks. Once we have this information, we can come up with solutions.

> Based on my preliminary investigation, we can do quite some optimizations, which quite
frequently depend on a specific data type.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message