drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Muhammad Gelbana <m.gelb...@gmail.com>
Subject Re: Running cartesian joins on Drill
Date Thu, 11 May 2017 22:17:46 GMT
​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*---------------------*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zfong@mapr.com> wrote:

> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
> without an explicit join condition, the query should use the Cartesian
> join/nested loop join.
>
> -- Zelaine
>
> On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiwari@games24x7.com> wrote:
>
>     Hi,
>
>     I have one question here.. so if we have to use Cartesian join in Drill
>     then do we have to follow some workaround like Shadi mention : adding a
>     dummy column on the fly that has the value 1 in both tables and then
> join
>     on that column leading to having a match of every row of the first
> table
>     with every row of the second table, hence do a Cartesian product?
>     OR
>     If we just don't specify join condition like :
>     select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
> this
>     query as Cartesian join.
>
>     Regards,
>     *Anup Tiwari*
>
>     On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zfong@mapr.com> wrote:
>
>     > Cartesian joins in Drill are implemented as nested loop joins, and I
> think
>     > you should see that reflected in the resultant query plan when you
> run
>     > explain plan on the query.
>     >
>     > Yes, Cartesian joins/nested loop joins are expensive because you’re
>     > effectively doing an MxN read of your tables.  There are more
> efficient
>     > ways of processing a nested loop join, e.g., by creating an index on
> the
>     > larger table in the join and then using that index to do lookups
> into that
>     > table.  That way, the nested loop join cost is the cost of creating
> the
>     > index + M, where M is the number of rows in the smaller table and
> assuming
>     > the lookup cost into the index does minimize the amount of data read
> of the
>     > second table.  Drill currently doesn’t do this.
>     >
>     > -- Zelaine
>     >
>     > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelbana@gmail.com> wrote:
>     >
>     >     ​I believe ​clhubert is referring to this discussion
>     >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
>     > cartesian-product-in-apache-drill#post1>
>     >     .
>     >
>     >     So why Drill doesn't transform this query into a nested join
> query ?
>     > Simply
>     >     because there is no Calcite rule to transform it into a nested
> loop
>     > join ?
>     >     Is it not technically possible to write such Rule or is it
> feasible so
>     > I
>     >     may take on this challenge ?
>     >
>     >     Also pardon me for repeating my question but I fail to find an
> answer
>     > in
>     >     your replies, why doesn't Drill just run a cartesian join ?
> Because
>     > it's
>     >     expensive regarding resources (i.e. CPU\Network\RAM) ?
>     >
>     >     Thanks a lot Shadi for the query, it works for me.
>     >
>     >     *---------------------*
>     >     *Muhammad Gelbana*
>     >     http://www.linkedin.com/in/mgelbana
>     >
>     >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
> khalifa@cs.queensu.ca>
>     > wrote:
>     >
>     >     > Hi Muhammad,
>     >     >
>     >     > I did the following as a workaround to have Cartesian product.
> The
>     > basic
>     >     > idea is to create a dummy column on the fly that has the value
> 1 in
>     > both
>     >     > tables and then join on that column leading to having a match
> of
>     > every row
>     >     > of the first table with every row of the second table, hence
> do a
>     > Cartesian
>     >     > product. This might not be the most efficient way but it will
> do the
>     > job.
>     >     >
>     >     > *Original Query:*
>     >     > SELECT * FROM
>     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
>     >     > 2147483647) `t0`
>     >     > INNER JOIN
>     >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
>     >     > 2147483647) `t1`
>     >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
>     >     > LIMIT 2147483647
>     >     >
>     >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
> tables
>     > one
>     >     > and two, respectively. Names don't really matter, just need to
> be
>     > unique):*
>     >     > SELECT * FROM
>     >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
>     >     > INNER JOIN
>     >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
>     >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
>     >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
>     >     > LIMIT 2147483647
>     >     >
>     >     > Regards
>     >     >
>     >     >
>     >     > *Shadi Khalifa, PhD*
>     >     > Postdoctoral Fellow
>     >     > Cognitive Analytics Development Hub
>     >     > Centre for Advanced Computing
>     >     > Queen’s University
>     >     > (613) 533-6000 x78347
>     >     > http://cac.queensu.ca
>     >     >
>     >     > I'm just a neuron in the society collective brain
>     >     >
>     >     > *Join us for HPCS in June 2017! Register at:*  *
> http://2017.hpcs.ca/
>     >     > <http://2017.hpcs.ca/>*
>     >     >
>     >     > P Please consider your environmental responsibility before
> printing
>     > this
>     >     > e-mail
>     >     >
>     >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
>     > 01000101
>     >     > 01100111 01111001 01110000 01110100 *
>     >     >
>     >     > *The information transmitted is intended only for the person or
>     > entity to
>     >     > which it is addressed and may contain confidential material.
> Any
>     > review or
>     >     > dissemination of this information by persons other than the
> intended
>     >     > recipient is prohibited. If you received this in error, please
>     > contact the
>     >     > sender and delete the material from any computer. Thank you.*
>     >     >
>     >     >
>     >     >
>     >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
>     > m.gelbana@gmail.com>
>     >     > wrote:
>     >     >
>     >     >
>     >     > ​​
>     >     > Here it is:
>     >     >
>     >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
>     > `dfs`.`path_to_parquet_file` tc
>     >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>     >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>     >     > ​​
>     >     > `t0`.`UserID` IS NOT DISTINCT FROM
>     >     > ​​
>     >     > `t1`.`UserID`) LIMIT 2147483647
>     >     >
>     >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT
> FROM*
>     > into
>     >     > ​
>     >     > *`t0`.`UserID` = ​`t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
>     > `t1`.`UserID`
>     >     > IS NULL**)* while checking if the query is a cartesian join,
> and
>     > when the
>     >     > check returns true, it throws an excetion saying: *This query
> cannot
>     > be
>     >     > planned possibly due to either a cartesian join or an
> inequality
>     > join*
>     >     >
>     >     >
>     >     > *---------------------*
>     >     > *Muhammad Gelbana*
>     >     > http://www.linkedin.com/in/mgelbana
>     >     >
>     >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gparai@mapr.com>
>     > wrote:
>     >     >
>     >     > > Can you please specify the query you are trying to execute?
>     >     > >
>     >     > >
>     >     > > Gautam
>     >     > >
>     >     > > ________________________________
>     >     > > From: Muhammad Gelbana <m.gelbana@gmail.com>
>     >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
>     >     > > To: user@drill.apache.org; dev@drill.apache.org
>     >     > > Subject: Running cartesian joins on Drill
>     >     > >
>     >     > > Is there a reason why Drill would intentionally reject
> cartesian
>     > join
>     >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
>     > disabled ?
>     >     > >
>     >     > > Any ideas how could a query be rewritten to overcome this
>     > restriction ?
>     >     > >
>     >     > > *---------------------*
>     >     > > *Muhammad Gelbana*
>     >     > > http://www.linkedin.com/in/mgelbana
>     >     > >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message