spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DB Tsai (JIRA)" <>
Subject [jira] [Created] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.
Date Wed, 23 May 2018 20:43:01 GMT
DB Tsai created SPARK-24371:

             Summary: Added isinSet in DataFrame API for Scala and Java.
                 Key: SPARK-24371
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: DB Tsai
            Assignee: DB Tsai
             Fix For: 2.4.0

Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users can do
 val profileDF = Seq(
 Some(1), Some(2), Some(3), Some(4),
 Some(5), Some(6), Some(7), None

val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")

val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))


Two new rules in the logical plan optimizers are added.

# When there is only one element in the *{{Set}}*, the physical plan will be optimized to
*{{EqualTo}}*, so predicate pushdown can be used.
 profileDF.filter( $"profileID".isinSet(Set(6))).explain(true)
|== Physical Plan ==|
|*(1) Project [profileID#0|#0]|
|+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
|+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
|PartitionFilters: [],|
|PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
|ReadSchema: struct<profileID:int>

# When the *{{Set}}* is empty, and the input is nullable, the logical plan will be simplified
 profileDF.filter( $"profileID".isinSet(Set())).explain(true)
|== Optimized Logical Plan ==|
|Filter if (isnull(profileID#0)) null else false|
|+- Relation[profileID#0|#0] parquet

 # For multiple conditions with numbers less than certain thresholds, we should still allow
predicate pushdown.
 # Optimize the `In` using tableswitch or lookupswitch when the numbers of the categories
are low, and they are `Int`, `Long`.
 # The default immutable hash trees set is slow for query, and we should do benchmark for
using different set implementation for faster query.
 # `filter(if (condition) null else false)` can be optimized to false.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message