Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm
Precedence: bulk
From: hvanhovell <git@git.apache.org>
To: reviews@spark.apache.org
Reply-To: reviews@spark.apache.org
References: <git-pr-15763-spark@git.apache.org>
In-Reply-To: <git-pr-15763-spark@git.apache.org>
Subject: [GitHub] spark pull request #15763: [SPARK-17348][SQL] Incorrect results from subquer...
Content-Type: text/plain
Message-Id: <20161107201818.66CF0E1772@git1-us-west.apache.org>
Date: Mon,  7 Nov 2016 20:18:18 +0000 (UTC)
archived-at: Mon, 07 Nov 2016 20:18:20 -0000

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15763#discussion_r86859510
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1044,6 +1044,34 @@ class Analyzer(
               failOnOuterReference(p)
               p
           }
    +
    +      // SPARK-17348
    +      // Looking for a potential incorrect result case.
    +      // When a correlated predicate is a non-equality predicate
    +      // it must be placed at the immediate child operator.
    +      // Otherwise, the pull up of the correlated predicate
    +      // will generate a plan with a different semantics
    +      // which could return incorrect result.
    +      var continue : Boolean = true
    --- End diff --
    
    @nsyca thanks for the explanation of how DB2 works with subqueries. A different perspective or approach can be very helpful; we all suffer from myopia at some point. It most certainly has merit to add a general node for subquery processing to Spark. Do you have time to work on this for Spark 2.2?
    
    I would also like to take the opportunity to explain why we do so much rewriting during analysis. We wanted support the following use case:
    ```sql
    -- hive: subquery_exists_having.q
    select b.key, min(b.value)
    from src b
    group by b.key
    having exists ( select a.key
                    from src a
                    where a.value > 'val_9' and a.value = min(b.value)
                    )
    ```
    The difficulty here is that we need to evaluate the `min(b.value)` in the aggregate. So we needed a way to extract the entire `min(b.value)` expression. The most straightforward way was to extract the entire predicate and rewrite the tree in the process. This is quite an aggressive approach, and it breaks as soon as you cannot/should not move the predicate. In hindsight it might have been better to isolate the entire outer expression instead of only isolating the outer reference, and to do the rewriting in a later stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org