drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Drill query planning error
Date Wed, 26 Jul 2017 18:25:26 GMT
Aman,

Thanks for moving dev@calcite to Bcc. This is properly a Drill question.

A blanket restriction on cartesian joins is a blunt instrument. Sometimes cartesian joins
are valid, safe, and the best plan for a query. This is a case in point. Users shouldn’t
have to change config parameters to get it to work.

(Actually I don’t know the query, but

  select count(distinct deptno), count(distinct gender) from emp 

is equivalent.)

Drill should detect that a relational expression can return at most one row, and allow a cartesian
join if one side is such. Calcite has a RelMdMaxRowCount statistic for this. This was added
as part of http://issues.apache.org/jira/browse/CALCITE-604 <http://issues.apache.org/jira/browse/CALCITE-604>.
This rule is 100% safe. No config parameters required.

Also, Calcite has an alternative way of handling multiple distinct aggregates that rewrites
to use grouping sets. It doesn’t generate self-joins, cartesian or otherwise.  http://issues.apache.org/jira/browse/CALCITE-732
<http://issues.apache.org/jira/browse/CALCITE-732>. 

Julian






> On Jul 26, 2017, at 9:20 AM, Aman Sinha <amansinha@apache.org> wrote:
> 
> [Since this is Drill specific, I put dev@calcite on BCC].
> 
> If you have two aggregates: Count(distinct a), Count(distinct b), the
> Calcite logical plan consists of a cartesian join of 2 subqueries each of
> which first does a group-by on the distinct column followed by a count
> aggregate.   By default,  Drill only processes cartesian join if one input
> of the join is known to be scalar (single row).  It sounds like after you
> did the transformation to use the cache, that scalar property somehow did
> not get propagated.
> You can override this behavior by a session configuration:  (this will use
> a nested loop join even if inputs are not provably scalar, but it should be
> used for specific query only).    For a more general solution, I believe
> you may have to create an enhancement JIRA with appropriate details.
>   'alter session set planner.enable_nljoin_for_scalar_only = false';
> 
> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <tongweijie178@gmail.com>
> wrote:
> 
>> HI all:
>> 
>>  I materialize the count distinct query result to a cache, then when user
>> query the count distinct , a specific rule will translate the query to the
>> cache. It turns out right when the query has only one count (distinct )
>> operator ,but when it has two count (distinct ) ,it causes error .The error
>> info is here:
>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
>> 
>> 
>> Best Regards.
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message