drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zelaine Fong <zf...@mapr.com>
Subject Re: Running cartesian joins on Drill
Date Thu, 11 May 2017 22:23:52 GMT
I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:

0: jdbc:drill:zk=local> alter session set `planner.enable_nljoin_for_scalar_only` = false;
+-------+-------------------------------------------------+
|  ok   |                     summary                     |
+-------+-------------------------------------------------+
| true  | planner.enable_nljoin_for_scalar_only updated.  |
+-------+-------------------------------------------------+
1 row selected (0.137 seconds)
0: jdbc:drill:zk=local> explain plan for select * from dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv`
t2;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      ProjectAllowDup(*=[$0], *0=[$1])
00-02        NestedLoopJoin(condition=[true], joinType=[inner])
00-04          Project(T2¦¦*=[$0])
00-06            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1,
columns=[`*`], files=[file:/Users/zfong/foo.csv]]])
00-03          Project(T3¦¦*=[$0])
00-05            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1,
columns=[`*`], files=[file:/Users/zfong/foo.csv]]])

-- Zelaine

On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelbana@gmail.com> wrote:

    ​But the query I provided failed to be planned because it's a cartesian
    join, although I've set the option you mentioned to false. Is there a
    reason why wouldn't Drill rules physically implement the logical join in my
    query to a nested loop join ?
    
    *---------------------*
    *Muhammad Gelbana*
    http://www.linkedin.com/in/mgelbana
    
    On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zfong@mapr.com> wrote:
    
    > Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
    > without an explicit join condition, the query should use the Cartesian
    > join/nested loop join.
    >
    > -- Zelaine
    >
    > On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiwari@games24x7.com> wrote:
    >
    >     Hi,
    >
    >     I have one question here.. so if we have to use Cartesian join in Drill
    >     then do we have to follow some workaround like Shadi mention : adding a
    >     dummy column on the fly that has the value 1 in both tables and then
    > join
    >     on that column leading to having a match of every row of the first
    > table
    >     with every row of the second table, hence do a Cartesian product?
    >     OR
    >     If we just don't specify join condition like :
    >     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
    > this
    >     query as Cartesian join.
    >
    >     Regards,
    >     *Anup Tiwari*
    >
    >     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zfong@mapr.com> wrote:
    >
    >     > Cartesian joins in Drill are implemented as nested loop joins, and I
    > think
    >     > you should see that reflected in the resultant query plan when you
    > run
    >     > explain plan on the query.
    >     >
    >     > Yes, Cartesian joins/nested loop joins are expensive because you’re
    >     > effectively doing an MxN read of your tables.  There are more
    > efficient
    >     > ways of processing a nested loop join, e.g., by creating an index on
    > the
    >     > larger table in the join and then using that index to do lookups
    > into that
    >     > table.  That way, the nested loop join cost is the cost of creating
    > the
    >     > index + M, where M is the number of rows in the smaller table and
    > assuming
    >     > the lookup cost into the index does minimize the amount of data read
    > of the
    >     > second table.  Drill currently doesn’t do this.
    >     >
    >     > -- Zelaine
    >     >
    >     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelbana@gmail.com> wrote:
    >     >
    >     >     ​I believe ​clhubert is referring to this discussion
    >     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
    >     > cartesian-product-in-apache-drill#post1>
    >     >     .
    >     >
    >     >     So why Drill doesn't transform this query into a nested join
    > query ?
    >     > Simply
    >     >     because there is no Calcite rule to transform it into a nested
    > loop
    >     > join ?
    >     >     Is it not technically possible to write such Rule or is it
    > feasible so
    >     > I
    >     >     may take on this challenge ?
    >     >
    >     >     Also pardon me for repeating my question but I fail to find an
    > answer
    >     > in
    >     >     your replies, why doesn't Drill just run a cartesian join ?
    > Because
    >     > it's
    >     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
    >     >
    >     >     Thanks a lot Shadi for the query, it works for me.
    >     >
    >     >     *---------------------*
    >     >     *Muhammad Gelbana*
    >     >     http://www.linkedin.com/in/mgelbana
    >     >
    >     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
    > khalifa@cs.queensu.ca>
    >     > wrote:
    >     >
    >     >     > Hi Muhammad,
    >     >     >
    >     >     > I did the following as a workaround to have Cartesian product.
    > The
    >     > basic
    >     >     > idea is to create a dummy column on the fly that has the value
    > 1 in
    >     > both
    >     >     > tables and then join on that column leading to having a match
    > of
    >     > every row
    >     >     > of the first table with every row of the second table, hence
    > do a
    >     > Cartesian
    >     >     > product. This might not be the most efficient way but it will
    > do the
    >     > job.
    >     >     >
    >     >     > *Original Query:*
    >     >     > SELECT * FROM
    >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT
    >     >     > 2147483647) `t0`
    >     >     > INNER JOIN
    >     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT
    >     >     > 2147483647) `t1`
    >     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    >     >     > LIMIT 2147483647
    >     >     >
    >     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
    > tables
    >     > one
    >     >     > and two, respectively. Names don't really matter, just need to
    > be
    >     > unique):*
    >     >     > SELECT * FROM
    >     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    >     >     > INNER JOIN
    >     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    >     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    >     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    >     >     > LIMIT 2147483647
    >     >     >
    >     >     > Regards
    >     >     >
    >     >     >
    >     >     > *Shadi Khalifa, PhD*
    >     >     > Postdoctoral Fellow
    >     >     > Cognitive Analytics Development Hub
    >     >     > Centre for Advanced Computing
    >     >     > Queen’s University
    >     >     > (613) 533-6000 x78347
    >     >     > http://cac.queensu.ca
    >     >     >
    >     >     > I'm just a neuron in the society collective brain
    >     >     >
    >     >     > *Join us for HPCS in June 2017! Register at:*  *
    > http://2017.hpcs.ca/
    >     >     > <http://2017.hpcs.ca/>*
    >     >     >
    >     >     > P Please consider your environmental responsibility before
    > printing
    >     > this
    >     >     > e-mail
    >     >     >
    >     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
    >     > 01000101
    >     >     > 01100111 01111001 01110000 01110100 *
    >     >     >
    >     >     > *The information transmitted is intended only for the person or
    >     > entity to
    >     >     > which it is addressed and may contain confidential material.
    > Any
    >     > review or
    >     >     > dissemination of this information by persons other than the
    > intended
    >     >     > recipient is prohibited. If you received this in error, please
    >     > contact the
    >     >     > sender and delete the material from any computer. Thank you.*
    >     >     >
    >     >     >
    >     >     >
    >     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
    >     > m.gelbana@gmail.com>
    >     >     > wrote:
    >     >     >
    >     >     >
    >     >     > ​​
    >     >     > Here it is:
    >     >     >
    >     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc
    >     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    >     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    >     >     > ​​
    >     >     > `t0`.`UserID` IS NOT DISTINCT FROM
    >     >     > ​​
    >     >     > `t1`.`UserID`) LIMIT 2147483647
    >     >     >
    >     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT
    > FROM*
    >     > into
    >     >     > ​
    >     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
    >     > `t1`.`UserID`
    >     >     > IS NULL**)* while checking if the query is a cartesian join,
    > and
    >     > when the
    >     >     > check returns true, it throws an excetion saying: *This query
    > cannot
    >     > be
    >     >     > planned possibly due to either a cartesian join or an
    > inequality
    >     > join*
    >     >     >
    >     >     >
    >     >     > *---------------------*
    >     >     > *Muhammad Gelbana*
    >     >     > http://www.linkedin.com/in/mgelbana
    >     >     >
    >     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gparai@mapr.com>
    >     > wrote:
    >     >     >
    >     >     > > Can you please specify the query you are trying to execute?
    >     >     > >
    >     >     > >
    >     >     > > Gautam
    >     >     > >
    >     >     > > ________________________________
    >     >     > > From: Muhammad Gelbana <m.gelbana@gmail.com>
    >     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
    >     >     > > To: user@drill.apache.org; dev@drill.apache.org
    >     >     > > Subject: Running cartesian joins on Drill
    >     >     > >
    >     >     > > Is there a reason why Drill would intentionally reject
    > cartesian
    >     > join
    >     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
    >     > disabled ?
    >     >     > >
    >     >     > > Any ideas how could a query be rewritten to overcome this
    >     > restriction ?
    >     >     > >
    >     >     > > *---------------------*
    >     >     > > *Muhammad Gelbana*
    >     >     > > http://www.linkedin.com/in/mgelbana
    >     >     > >
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >
    

Mime
View raw message