spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liang-Chi Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
Date Sun, 06 May 2018 07:30:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465000#comment-16465000
] 

Liang-Chi Hsieh edited comment on SPARK-21274 at 5/6/18 7:29 AM:
-----------------------------------------------------------------

I read the design doc. It looks correct to me. I found a rewrite rule in Presto for INTERSECT
that seems more simple to me at [https://github.com/prestodb/presto/issues/4918#issuecomment-207106688.]

That rule can be used to do INTERSECT ALL and EXCEPT ALL, if I don't miss anything.

For example, to do INTERSECT ALL like:

{{SELECT a FROM foo INTERSECT ALL SELECT x FROM bar}}

{{We can rewrite it as:}}
{code:java}
SELECT a FROM (
  SELECT replicate_row(min_count, a) AS (min_count, a) FROM (
    SELECT a, COUNT(foo_marker) AS foo_cnt, COUNT(bar_marker) AS bar_cnt, IF(COUNT(foo_marker)
> COUNT(bar_marker), COUNT(bar_marker), COUNT(foo_marker)) AS min_count
    FROM (
      SELECT a, true as foo_marker, null as bar_marker FROM foo
      UNION ALL
      SELECT x, null as foo_marker, true as bar_marker FROM bar
    ) T1
    GROUP BY a) T2
  WHERE foo_cnt >= 1 AND bar_cnt >= 1
)
{code}
 

One advantage of that rewrite rule is the rules of INTERSECT ALL and EXCEPT ALL are more similar
to each other.

Another one is for INTERSECT ALL, it only needs one GROUP BY instead of three GROUP BY in
current design.

WDYT? [~dkbiswal] [~maropu]

 

 

 


was (Author: viirya):
I read the design doc. It looks correct to me. I found a rewrite rule in Presto for INTERSECT
that seems more simple to me at [https://github.com/prestodb/presto/issues/4918#issuecomment-207106688.]

That rule can be used to do INTERSECT ALL and EXCEPT ALL, if I don't miss anything.

For example, to do INTERSECT ALL like:

{{SELECT a FROM foo INTERSECT ALL SELECT x FROM bar}}

{{We can rewrite it as:}}
{code:java}
SELECT a FROM (
  SELECT replicate_row(min_count, a) AS (min_count, a) FROM (
    SELECT a, COUNT(foo_marker) AS foo_cnt, COUNT(bar_marker) AS bar_cnt, IF(COUNT(foo_marker)
> COUNT(bar_marker), COUNT(bar_marker), COUNT(foo_marker)) AS min_count
    FROM (
      SELECT a, true as foo_marker, null as bar_marker FROM foo
      UNION ALL
      SELECT x, null as foo_marker, true as bar_marker FROM bar
    ) T1
    GROUP BY a) T2
  WHERE foo_cnt >= 1 AND bar_cnt >= 1
)
{code}
 

One advantage of that rewrite rule is the rules of INTERSECT ALL and EXCEPT ALL are more similar
to each other.

Another one is for INTERSECT ALL, it only needs one GROUP BY instead of three GROUP BY in
current design.

WDYT?

 

 

 

> Implement EXCEPT ALL and INTERSECT ALL
> --------------------------------------
>
>                 Key: SPARK-21274
>                 URL: https://issues.apache.org/jira/browse/SPARK-21274
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
>      LEFT OUTER JOIN 
>         tab2 t2
>      ON (
>         (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>      )
> WHERE
>     COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name that can be
also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined above:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
> WHERE 
>    NOT EXISTS
>    (    SELECT 1
>         FROM    t1_except_t2_df e
>         WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>    )
> {code}
> So the suggestion is just to use above query rewrites to implement both EXCEPT ALL and
INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message