spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aokolnychyi <...@git.apache.org>
Subject [GitHub] spark issue #18692: [SPARK-21417][SQL] Detect joind conditions via filter ex...
Date Mon, 31 Jul 2017 20:20:35 GMT
Github user aokolnychyi commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    @gatorsmile I took a look at the case above. Indeed, the proposed rule triggers this issue
but only indirectly. In the example above, the optimizer will never reach a fixed point. Please,
find my investigation below.
    
    ```
    ... 
    
    // The new rule infers correct join predicates
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    :- Filter ((col1#32 = col2#33) && (col1#32 = 1))
    :  +- Relation[col1#32,col2#33] parquet
    +- Filter (col#34 = 1)
       +- Relation[col#34] parquet
    
    // InferFiltersFromConstraints adds more filters
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    :- Filter ((((col2#33 = 1) && isnotnull(col1#32)) && isnotnull(col2#33))
&& ((col1#32 = col2#33) && (col1#32 = 1)))
    :  +- Relation[col1#32,col2#33] parquet
    +- Filter (isnotnull(col#34) && (col#34 = 1))
       +- Relation[col#34] parquet
    
    // ConstantPropagation is applied
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    !:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32))
&& ((1 = col2#33) && (col1#32 = 1))) 
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet                          
    
    // (Important) InferFiltersFromConstraints infers (col1#32 = col2#33), which is added
to the join condition.
    Join Inner, ((col1#32 = col2#33) && ((col2#33 = col#34) && (col1#32 =
col#34)))
    !:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32))
&& ((1 = col2#33) && (col1#32 = 1))) 
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet
    
     // PushPredicateThroughJoin pushes down (col1#32 = col2#33) and then CombineFilters produces
    Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter ((((isnotnull(col1#32) && (col2#33 = 1)) && isnotnull(col2#33))
&& ((1 = col2#33) && (col1#32 = 1))) && (col2#33 = col1#32))
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet                                                      
               
    
    ```
    After that, `ConstantPropagation` replaces `(col2#33 = col1#32)` as `(1 = 1)`, `BooleanSimplification`
removes `(1 = 1)`, `InferFiltersFromConstraints` infers `(col2#33 = col1#32)` again and the
procedure repeats forever. Since `InferFiltersFromConstraints` is the last optimization rule,
we have the redundant condition mentioned by you. The Optimizer without the new rule will
also not converge on the following query:
    
    ```
    Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
    Seq(1, 2).toDF("col").write.saveAsTable("t2")
    spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col
AND t1.col2 = t2.col").explain(true)
    ```
    Correct me if I am wrong, but it seems like an issue with the existing rules.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message