spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hvanhovell <...@git.apache.org>
Subject [GitHub] spark pull request #15763: [SPARK-17348][SQL] Incorrect results from subquer...
Date Mon, 07 Nov 2016 20:18:18 GMT
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15763#discussion_r86859510
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
---
    @@ -1044,6 +1044,34 @@ class Analyzer(
               failOnOuterReference(p)
               p
           }
    +
    +      // SPARK-17348
    +      // Looking for a potential incorrect result case.
    +      // When a correlated predicate is a non-equality predicate
    +      // it must be placed at the immediate child operator.
    +      // Otherwise, the pull up of the correlated predicate
    +      // will generate a plan with a different semantics
    +      // which could return incorrect result.
    +      var continue : Boolean = true
    --- End diff --
    
    @nsyca thanks for the explanation of how DB2 works with subqueries. A different perspective
or approach can be very helpful; we all suffer from myopia at some point. It most certainly
has merit to add a general node for subquery processing to Spark. Do you have time to work
on this for Spark 2.2?
    
    I would also like to take the opportunity to explain why we do so much rewriting during
analysis. We wanted support the following use case:
    ```sql
    -- hive: subquery_exists_having.q
    select b.key, min(b.value)
    from src b
    group by b.key
    having exists ( select a.key
                    from src a
                    where a.value > 'val_9' and a.value = min(b.value)
                    )
    ```
    The difficulty here is that we need to evaluate the `min(b.value)` in the aggregate. So
we needed a way to extract the entire `min(b.value)` expression. The most straightforward
way was to extract the entire predicate and rewrite the tree in the process. This is quite
an aggressive approach, and it breaks as soon as you cannot/should not move the predicate.
In hindsight it might have been better to isolate the entire outer expression instead of only
isolating the outer reference, and to do the rewriting in a later stage.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message