drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <asi...@mapr.com>
Subject Re: Running cartesian joins on Drill
Date Thu, 11 May 2017 23:29:20 GMT
I think Muhammad may be trying to run his original query with IS NOT DISTINCT FROM.   That
discussion got side-tracked into Cartesian joins because his query was not getting planned
and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the rewrite but
did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS NULL)
)



On 5/11/17, 3:23 PM, "Zelaine Fong" <zfong@mapr.com> wrote:



    I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:



    0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` =
false;

    +-------+-------------------------------------------------+

    |  ok   |                     summary                     |

    +-------+-------------------------------------------------+

    | true  | planner.enable_nljoin_for_scalar_only updated.  |

    +-------+-------------------------------------------------+

    1 row selected (0.137 seconds)

    0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1,
dfs.`/Users/zfong/foo.csv` t2;

    +------+------+

    | text | json |

    +------+------+

    | 00-00    Screen

    00-01      ProjectAllowDup(*=[$0], *0=[$1])

    00-02        NestedLoopJoin(condition=[true], joinType=[inner])

    00-04          Project(T2¦¦*=[$0])

    00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv,
numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

    00-03          Project(T3¦¦*=[$0])

    00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv,
numFiles=1, columns=[`*`], files=[file:/Users/zfong/foo.csv]]])



    -- Zelaine



    On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelbana@gmail.com> wrote:



        ​But the query I provided failed to be planned because it's a cartesian

        join, although I've set the option you mentioned to false. Is there a

        reason why wouldn't Drill rules physically implement the logical join in my

        query to a nested loop join ?



        *---------------------*

        *Muhammad Gelbana*

        http://www.linkedin.com/in/mgelbana



        On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zfong@mapr.com> wrote:



        > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

        > without an explicit join condition, the query should use the Cartesian

        > join/nested loop join.

        >

        > -- Zelaine

        >

        > On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiwari@games24x7.com> wrote:

        >

        >     Hi,

        >

        >     I have one question here.. so if we have to use Cartesian join in Drill

        >     then do we have to follow some workaround like Shadi mention : adding a

        >     dummy column on the fly that has the value 1 in both tables and then

        > join

        >     on that column leading to having a match of every row of the first

        > table

        >     with every row of the second table, hence do a Cartesian product?

        >     OR

        >     If we just don't specify join condition like :

        >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat

        > this

        >     query as Cartesian join.

        >

        >     Regards,

        >     *Anup Tiwari*

        >

        >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zfong@mapr.com> wrote:

        >

        >     > Cartesian joins in Drill are implemented as nested loop joins, and I

        > think

        >     > you should see that reflected in the resultant query plan when you

        > run

        >     > explain plan on the query.

        >     >

        >     > Yes, Cartesian joins/nested loop joins are expensive because you’re

        >     > effectively doing an MxN read of your tables.  There are more

        > efficient

        >     > ways of processing a nested loop join, e.g., by creating an index on

        > the

        >     > larger table in the join and then using that index to do lookups

        > into that

        >     > table.  That way, the nested loop join cost is the cost of creating

        > the

        >     > index + M, where M is the number of rows in the smaller table and

        > assuming

        >     > the lookup cost into the index does minimize the amount of data read

        > of the

        >     > second table.  Drill currently doesn’t do this.

        >     >

        >     > -- Zelaine

        >     >

        >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelbana@gmail.com> wrote:

        >     >

        >     >     ​I believe ​clhubert is referring to this discussion

        >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/

        >     > cartesian-product-in-apache-drill#post1>

        >     >     .

        >     >

        >     >     So why Drill doesn't transform this query into a nested join

        > query ?

        >     > Simply

        >     >     because there is no Calcite rule to transform it into a nested

        > loop

        >     > join ?

        >     >     Is it not technically possible to write such Rule or is it

        > feasible so

        >     > I

        >     >     may take on this challenge ?

        >     >

        >     >     Also pardon me for repeating my question but I fail to find an

        > answer

        >     > in

        >     >     your replies, why doesn't Drill just run a cartesian join ?

        > Because

        >     > it's

        >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?

        >     >

        >     >     Thanks a lot Shadi for the query, it works for me.

        >     >

        >     >     *---------------------*

        >     >     *Muhammad Gelbana*

        >     >     http://www.linkedin.com/in/mgelbana

        >     >

        >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <

        > khalifa@cs.queensu.ca>

        >     > wrote:

        >     >

        >     >     > Hi Muhammad,

        >     >     >

        >     >     > I did the following as a workaround to have Cartesian product.

        > The

        >     > basic

        >     >     > idea is to create a dummy column on the fly that has the value

        > 1 in

        >     > both

        >     >     > tables and then join on that column leading to having a match

        > of

        >     > every row

        >     >     > of the first table with every row of the second table, hence

        > do a

        >     > Cartesian

        >     >     > product. This might not be the most efficient way but it will

        > do the

        >     > job.

        >     >     >

        >     >     > *Original Query:*

        >     >     > SELECT * FROM

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

        > LIMIT

        >     >     > 2147483647) `t1`

        >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to

        > tables

        >     > one

        >     >     > and two, respectively. Names don't really matter, just need
to

        > be

        >     > unique):*

        >     >     > SELECT * FROM

        >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`

        >     >     > INNER JOIN

        >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`

        >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)

        >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`

        >     >     > LIMIT 2147483647

        >     >     >

        >     >     > Regards

        >     >     >

        >     >     >

        >     >     > *Shadi Khalifa, PhD*

        >     >     > Postdoctoral Fellow

        >     >     > Cognitive Analytics Development Hub

        >     >     > Centre for Advanced Computing

        >     >     > Queen’s University

        >     >     > (613) 533-6000 x78347

        >     >     > http://cac.queensu.ca

        >     >     >

        >     >     > I'm just a neuron in the society collective brain

        >     >     >

        >     >     > *Join us for HPCS in June 2017! Register at:*  *

        > http://2017.hpcs.ca/

        >     >     > <http://2017.hpcs.ca/>*

        >     >     >

        >     >     > P Please consider your environmental responsibility before

        > printing

        >     > this

        >     >     > e-mail

        >     >     >

        >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000

        >     > 01000101

        >     >     > 01100111 01111001 01110000 01110100 *

        >     >     >

        >     >     > *The information transmitted is intended only for the person
or

        >     > entity to

        >     >     > which it is addressed and may contain confidential material.

        > Any

        >     > review or

        >     >     > dissemination of this information by persons other than the

        > intended

        >     >     > recipient is prohibited. If you received this in error, please

        >     > contact the

        >     >     > sender and delete the material from any computer. Thank you.*

        >     >     >

        >     >     >

        >     >     >

        >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <

        >     > m.gelbana@gmail.com>

        >     >     > wrote:

        >     >     >

        >     >     >

        >     >     > ​​

        >     >     > Here it is:

        >     >     >

        >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM

        >     > `dfs`.`path_to_parquet_file` tc

        >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

        >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

        >     >     > ​​

        >     >     > `t0`.`UserID` IS NOT DISTINCT FROM

        >     >     > ​​

        >     >     > `t1`.`UserID`) LIMIT 2147483647

        >     >     >

        >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT

        > FROM*

        >     > into

        >     >     > ​

        >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL
&&

        >     > `t1`.`UserID`

        >     >     > IS NULL**)* while checking if the query is a cartesian join,

        > and

        >     > when the

        >     >     > check returns true, it throws an excetion saying: *This query

        > cannot

        >     > be

        >     >     > planned possibly due to either a cartesian join or an

        > inequality

        >     > join*

        >     >     >

        >     >     >

        >     >     > *---------------------*

        >     >     > *Muhammad Gelbana*

        >     >     > http://www.linkedin.com/in/mgelbana

        >     >     >

        >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gparai@mapr.com>

        >     > wrote:

        >     >     >

        >     >     > > Can you please specify the query you are trying to execute?

        >     >     > >

        >     >     > >

        >     >     > > Gautam

        >     >     > >

        >     >     > > ________________________________

        >     >     > > From: Muhammad Gelbana <m.gelbana@gmail.com>

        >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM

        >     >     > > To: user@drill.apache.org; dev@drill.apache.org

        >     >     > > Subject: Running cartesian joins on Drill

        >     >     > >

        >     >     > > Is there a reason why Drill would intentionally reject

        > cartesian

        >     > join

        >     >     > > queries even if *planner.enable_nljoin_for_scalar_only*
is

        >     > disabled ?

        >     >     > >

        >     >     > > Any ideas how could a query be rewritten to overcome this

        >     > restriction ?

        >     >     > >

        >     >     > > *---------------------*

        >     >     > > *Muhammad Gelbana*

        >     >     > > http://www.linkedin.com/in/mgelbana

        >     >     > >

        >     >     >

        >     >     >

        >     >     >

        >     >

        >     >

        >     >

        >

        >

        >






Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message