drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zelaine Fong <zf...@mapr.com>
Subject Re: Running cartesian joins on Drill
Date Mon, 08 May 2017 16:30:50 GMT
Cartesian joins in Drill are implemented as nested loop joins, and I think you should see that
reflected in the resultant query plan when you run explain plan on the query.

Yes, Cartesian joins/nested loop joins are expensive because you’re effectively doing an
MxN read of your tables.  There are more efficient ways of processing a nested loop join,
e.g., by creating an index on the larger table in the join and then using that index to do
lookups into that table.  That way, the nested loop join cost is the cost of creating the
index + M, where M is the number of rows in the smaller table and assuming the lookup cost
into the index does minimize the amount of data read of the second table.  Drill currently
doesn’t do this.

-- Zelaine

On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelbana@gmail.com> wrote:

    ​I believe ​clhubert is referring to this discussion
    <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/cartesian-product-in-apache-drill#post1>
    .
    
    So why Drill doesn't transform this query into a nested join query ? Simply
    because there is no Calcite rule to transform it into a nested loop join ?
    Is it not technically possible to write such Rule or is it feasible so I
    may take on this challenge ?
    
    Also pardon me for repeating my question but I fail to find an answer in
    your replies, why doesn't Drill just run a cartesian join ? Because it's
    expensive regarding resources (i.e. CPU\Network\RAM) ?
    
    Thanks a lot Shadi for the query, it works for me.
    
    *---------------------*
    *Muhammad Gelbana*
    http://www.linkedin.com/in/mgelbana
    
    On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <khalifa@cs.queensu.ca> wrote:
    
    > Hi Muhammad,
    >
    > I did the following as a workaround to have Cartesian product. The basic
    > idea is to create a dummy column on the fly that has the value 1 in both
    > tables and then join on that column leading to having a match of every row
    > of the first table with every row of the second table, hence do a Cartesian
    > product. This might not be the most efficient way but it will do the job.
    >
    > *Original Query:*
    > SELECT * FROM
    > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    > 2147483647) `t0`
    > INNER JOIN
    > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    > 2147483647) `t1`
    > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    > LIMIT 2147483647
    >
    > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
    > and two, respectively. Names don't really matter, just need to be unique):*
    > SELECT * FROM
    > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    > INNER JOIN
    > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    > LIMIT 2147483647
    >
    > Regards
    >
    >
    > *Shadi Khalifa, PhD*
    > Postdoctoral Fellow
    > Cognitive Analytics Development Hub
    > Centre for Advanced Computing
    > Queen’s University
    > (613) 533-6000 x78347
    > http://cac.queensu.ca
    >
    > I'm just a neuron in the society collective brain
    >
    > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
    > <http://2017.hpcs.ca/>*
    >
    > P Please consider your environmental responsibility before printing this
    > e-mail
    >
    > *01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
    > 01100111 01111001 01110000 01110100 *
    >
    > *The information transmitted is intended only for the person or entity to
    > which it is addressed and may contain confidential material. Any review or
    > dissemination of this information by persons other than the intended
    > recipient is prohibited. If you received this in error, please contact the
    > sender and delete the material from any computer. Thank you.*
    >
    >
    >
    > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <m.gelbana@gmail.com>
    > wrote:
    >
    >
    > ​​
    > Here it is:
    >
    > SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    > ​​
    > `t0`.`UserID` IS NOT DISTINCT FROM
    > ​​
    > `t1`.`UserID`) LIMIT 2147483647
    >
    > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
    > ​
    > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
    > IS NULL**)* while checking if the query is a cartesian join, and when the
    > check returns true, it throws an excetion saying: *This query cannot be
    > planned possibly due to either a cartesian join or an inequality join*
    >
    >
    > *---------------------*
    > *Muhammad Gelbana*
    > http://www.linkedin.com/in/mgelbana
    >
    > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gparai@mapr.com> wrote:
    >
    > > Can you please specify the query you are trying to execute?
    > >
    > >
    > > Gautam
    > >
    > > ________________________________
    > > From: Muhammad Gelbana <m.gelbana@gmail.com>
    > > Sent: Saturday, May 6, 2017 7:34:53 AM
    > > To: user@drill.apache.org; dev@drill.apache.org
    > > Subject: Running cartesian joins on Drill
    > >
    > > Is there a reason why Drill would intentionally reject cartesian join
    > > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
    > >
    > > Any ideas how could a query be rewritten to overcome this restriction ?
    > >
    > > *---------------------*
    > > *Muhammad Gelbana*
    > > http://www.linkedin.com/in/mgelbana
    > >
    >
    >
    >
    

Mime
View raw message