drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arina Ielchiieva (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5375) Nested loop join: return correct result for left join
Date Thu, 23 Mar 2017 12:58:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938213#comment-15938213
] 

Arina Ielchiieva edited comment on DRILL-5375 at 3/23/17 12:57 PM:
-------------------------------------------------------------------

[~zfong]
Actually, I have analyzed nested loop join and types of joins it can support and came to the
conclusion that it should support INNER and LEFT joins only, as was done initially (before
my changes). Basically, nested loop join is not good candidate for RIGHT or FULL joins because
of its implementation specifics, planner won't pick nested loop join for such joins as it's
not optimal. If we want support non-equi join with RIGHT and FULL joins then we need to add
support for non-equi joins in hash and merge joins which are much better candidates for such
types of joins.

The main idea of nested loop join is that it buffers data from right table (which should be
small enough) and for each left table record checks if any right table record satisfies the
join condition. Let's say we want try to allow RIGHT and FULL joins for nested loop join in
Drill.
Pre-conditions:
2 drillbits (we would assume that join will be performed on two nodes)
2 tables:
T1
||c1||
|A|
|B|
....  any other letters except C
T2
||c1||
|A|
|B|
|C|

Query: select * from t1 right join t2 on t1.c1 =  t2.c1
Expected result:
||t1.c1||t2.c1||
|A|A|
|B|B|
|null|C|

Drill buffers T2 table on each node.
Drillbit_1 receives batch from T1 (let's imagine that out batches will contain only one row):
*A*. It iterates over right input data and finds match: *A|A*. Also it marks that match for
B, C was not found.
Drillbit_2 receives batch from T1: *B*. It iterates over right input data and finds match:
*B|B*. Also it marks that match for A, C was not found.

Now to return correct RIGHT join output, we need to take statistics from two nodes (T2 rows
that didn't find match) and merge it: *B, C + A, C => C* and additionally output *null|C*.
Presumably there will be one node that waits output from all nodes. 

This doesn't really coincide with Drill batch iteration approach and may be too tricky to
implement. Thus as I have mentioned before, if we need RIGHT or FULL join to support non-equi
joins, more correctly will be to add non-equi join support for hash and merge joins.

Regarding planner.enable_join_optimization it is enabled by default. User may want to disable
it for the reasons described in above comment.




was (Author: arina):
[~zfong]
Actually, I have analyzed nested loop join and types of joins it can support and came to the
conclusion that it should support INNER and LEFT joins only, as was done initially (before
my changes). Basically, nested loop join is not good candidate for RIGHT or FULL joins because
of its implementation specifics, planner won't pick nested loop join for such joins as it's
not optimal. If we want support non-equi join with RIGHT and FULL joins then we need to add
support for non-equi joins in hash and merge joins which are much better candidates for such
types of joins.

The main idea of nested loop join is that it buffers data from right table (which should be
small enough) and for each left table record checks if any right table record satisfies the
join condition. Let's say we want try to allow RIGHT and FULL joins for nested loop join in
Drill.
Pre-conditions:
2 drillbits (we would assume that join will be performed on two nodes)
2 tables:
T1
||c1||
|A|
|B|
....  any other letters except C
T2
||c1||
|A|
|B|
|C|

Query: select * from t1 right join t2 on t1.c1 =  t2.c1
Expected result:
||t1.c1||t2.c1||
|A|A|
|B|B|
|null|C|

Drill buffers T2 table on each node.
Drillbit_1 receives batch from T1 (let's imagine that out batches will contain only one row):
*A*. It iterates over right input data and finds match: *A|A*. Also it marks that match for
B, C was not found.
Drillbit_2 receives batch from T1: *B*. It iterates over right input data and finds match:
*B|B*. Also it marks that match for A, C was not found.

Now to return correct RIGHT join output, we need to take statistics from two nodes (T2 rows
that didn't find match) and merge it: *B, C + A, C => C* and additionally output *null|C*.
Presumably there will be one node that waits output from all nodes. 

This doesn't really coincide with Drill batch iteration approach and may be too tricky to
implement. Thus as I have mentioned before, if we need RIGHT or FULL join to support non-equi
joins, more correctly will be to add non-equi join support for hash and merge joins.




> Nested loop join: return correct result for left join
> -----------------------------------------------------
>
>                 Key: DRILL-5375
>                 URL: https://issues.apache.org/jira/browse/DRILL-5375
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>              Labels: doc-impacting
>
> Mini repro:
> 1. Create 2 Hive tables with data
> {code}
> CREATE TABLE t1 (
>   FYQ varchar(999),
>   dts varchar(999),
>   dte varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> 2016-Q1,2016-06-01,2016-09-30
> 2016-Q2,2016-09-01,2016-12-31
> 2016-Q3,2017-01-01,2017-03-31
> 2016-Q4,2017-04-01,2017-06-30
> CREATE TABLE t2 (
>   who varchar(999),
>   event varchar(999),
>   dt varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> aperson,did somthing,2017-01-06
> aperson,did somthing else,2017-01-12
> aperson,had chrsitmas,2016-12-26
> aperson,went wild,2016-01-01
> {code}
> 2. Impala Query shows correct result
> {code}
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +------------+---------+---------+-------------------+
> | dt         | fyq     | who     | event             |
> +------------+---------+---------+-------------------+
> | 2016-01-01 | NULL    | aperson | went wild         |
> | 2016-12-26 | 2016-Q2 | aperson | had chrsitmas     |
> | 2017-01-06 | 2016-Q3 | aperson | did somthing      |
> | 2017-01-12 | 2016-Q3 | aperson | did somthing else |
> +------------+---------+---------+-------------------+
> {code}
> 3. Drill query shows wrong results:
> {code}
> alter session set planner.enable_nljoin_for_scalar_only=false;
> use hive;
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +-------------+----------+----------+--------------------+
> |     dt      |   fyq    |   who    |       event        |
> +-------------+----------+----------+--------------------+
> | 2016-12-26  | 2016-Q2  | aperson  | had chrsitmas      |
> | 2017-01-06  | 2016-Q3  | aperson  | did somthing       |
> | 2017-01-12  | 2016-Q3  | aperson  | did somthing else  |
> +-------------+----------+----------+--------------------+
> 3 rows selected (2.523 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message