spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nsyca <...@git.apache.org>
Subject [GitHub] spark issue #14411: [SPARK-16804][SQL] Correlated subqueries containing LIMI...
Date Mon, 01 Aug 2016 23:45:01 GMT
Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/14411
  
    @hvanhovell,
    
    First, my apologies for delaying the replies. I am travelling this week, only getting
spontaneous connections. Thank you for your explanation of the implementation and the reason
behind the choice of the implementation. It is very helpful for a beginner like me.
    
    My bad! What I meant in my previous comment on rewriting of subqueries to join is actually
the moving of the positions of the correlated predicates from their original positions to
outside of the scopes of subqueries, specifically, the call to the function pullOutCorrelatedPredicates()
-- I hope I got it right this time. I see this as one of the root causes of many problems.
Bear with me, I don't have a good solution as I am still getting myself familiar with the
code. Here is an example of the problems, in my opinion. With the rewrite, we cannot distinct
between the EXISTS form and IN form of the original SQL.
    
    select * from t1 where exists (select 1 from t2 where t1.c1=t2.c2)
    -and-
    select * from t1 where t1.c1 in (select t2.c2 from t2)
    
    are represented after Analysis phase. This does not have issue because they are semantically
equivalent. However, when we add the NOT in
    
    select * from t1 where not exists (select 1 from t2 where t1.c1=t2.c2)
    -and-
    select * from t1 where t1.c1 not in (select t2.c2 from t2)
    
    are NOT semantically equivalent when T2.C2 can produce NULL values.
    
    Lastly, your comment on the operator SAMPLE seems right. I will give it shot on adding
it to this PR.
    
    Thanks again for your patience.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message