Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5FB05200BB9 for ; Mon, 7 Nov 2016 21:18:20 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5CD19160AEC; Mon, 7 Nov 2016 20:18:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A383C160AE0 for ; Mon, 7 Nov 2016 21:18:19 +0100 (CET) Received: (qmail 59758 invoked by uid 500); 7 Nov 2016 20:18:18 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 59747 invoked by uid 99); 7 Nov 2016 20:18:18 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2016 20:18:18 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 66CF0E1772; Mon, 7 Nov 2016 20:18:18 +0000 (UTC) From: hvanhovell To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request #15763: [SPARK-17348][SQL] Incorrect results from subquer... Content-Type: text/plain Message-Id: <20161107201818.66CF0E1772@git1-us-west.apache.org> Date: Mon, 7 Nov 2016 20:18:18 +0000 (UTC) archived-at: Mon, 07 Nov 2016 20:18:20 -0000 Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15763#discussion_r86859510 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1044,6 +1044,34 @@ class Analyzer( failOnOuterReference(p) p } + + // SPARK-17348 + // Looking for a potential incorrect result case. + // When a correlated predicate is a non-equality predicate + // it must be placed at the immediate child operator. + // Otherwise, the pull up of the correlated predicate + // will generate a plan with a different semantics + // which could return incorrect result. + var continue : Boolean = true --- End diff -- @nsyca thanks for the explanation of how DB2 works with subqueries. A different perspective or approach can be very helpful; we all suffer from myopia at some point. It most certainly has merit to add a general node for subquery processing to Spark. Do you have time to work on this for Spark 2.2? I would also like to take the opportunity to explain why we do so much rewriting during analysis. We wanted support the following use case: ```sql -- hive: subquery_exists_having.q select b.key, min(b.value) from src b group by b.key having exists ( select a.key from src a where a.value > 'val_9' and a.value = min(b.value) ) ``` The difficulty here is that we need to evaluate the `min(b.value)` in the aggregate. So we needed a way to extract the entire `min(b.value)` expression. The most straightforward way was to extract the entire predicate and rewrite the tree in the process. This is quite an aggressive approach, and it breaks as soon as you cannot/should not move the predicate. In hindsight it might have been better to isolate the entire outer expression instead of only isolating the outer reference, and to do the rewriting in a later stage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org