drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From weijie tong <tongweijie...@gmail.com>
Subject Re: Drill query planning error
Date Wed, 26 Jul 2017 22:13:04 GMT
Thanks for pointing out the possible reasons @Aman @Julian . I am not sure
that's the real problem .

As I initially mentioned,sorry for not clearly described ,queries like
"select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
right.but "select count(*),sum(a),count (distinct b),count(distinct c) from
t where dt=xxx"will be wrong.  The only one count(distinct) scenario does
do the Cartesian join check too.It will and a broadcast operator on my
transferred relnode. So I think I have not loose the scalar property. But
when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
 check for satisfying the Cartesian conditions happened only once.It sounds
should happen twice as there are two nested loop join operators at the two
count(distinct) scenarios .If it turns out a bug ,I will log a Jira .

Best Regards


On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <amansinha@apache.org> wrote:

> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks for
> pointing that out.   I'll see if we can leverage that.  The grouping sets
> is not yet supported in Drill, but that would be a better solution since it
> avoids extra scans.
>
> -Aman
>
> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jhyde@apache.org> wrote:
>
> > Aman,
> >
> > Thanks for moving dev@calcite to Bcc. This is properly a Drill question.
> >
> > A blanket restriction on cartesian joins is a blunt instrument. Sometimes
> > cartesian joins are valid, safe, and the best plan for a query. This is a
> > case in point. Users shouldn’t have to change config parameters to get it
> > to work.
> >
> > (Actually I don’t know the query, but
> >
> >   select count(distinct deptno), count(distinct gender) from emp
> >
> > is equivalent.)
> >
> > Drill should detect that a relational expression can return at most one
> > row, and allow a cartesian join if one side is such. Calcite has a
> > RelMdMaxRowCount statistic for this. This was added as part of
> > http://issues.apache.org/jira/browse/CALCITE-604 <
> > http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100%
> > safe. No config parameters required.
> >
> > Also, Calcite has an alternative way of handling multiple distinct
> > aggregates that rewrites to use grouping sets. It doesn’t generate
> > self-joins, cartesian or otherwise.  http://issues.apache.org/jira/
> > browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>.
> >
> > Julian
> >
> >
> >
> >
> >
> >
> > > On Jul 26, 2017, at 9:20 AM, Aman Sinha <amansinha@apache.org> wrote:
> > >
> > > [Since this is Drill specific, I put dev@calcite on BCC].
> > >
> > > If you have two aggregates: Count(distinct a), Count(distinct b), the
> > > Calcite logical plan consists of a cartesian join of 2 subqueries each
> of
> > > which first does a group-by on the distinct column followed by a count
> > > aggregate.   By default,  Drill only processes cartesian join if one
> > input
> > > of the join is known to be scalar (single row).  It sounds like after
> you
> > > did the transformation to use the cache, that scalar property somehow
> did
> > > not get propagated.
> > > You can override this behavior by a session configuration:  (this will
> > use
> > > a nested loop join even if inputs are not provably scalar, but it
> should
> > be
> > > used for specific query only).    For a more general solution, I
> believe
> > > you may have to create an enhancement JIRA with appropriate details.
> > >   'alter session set planner.enable_nljoin_for_scalar_only = false';
> > >
> > > On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <tongweijie178@gmail.com>
> > > wrote:
> > >
> > >> HI all:
> > >>
> > >>  I materialize the count distinct query result to a cache, then when
> > user
> > >> query the count distinct , a specific rule will translate the query to
> > the
> > >> cache. It turns out right when the query has only one count (distinct
> )
> > >> operator ,but when it has two count (distinct ) ,it causes error .The
> > error
> > >> info is here:
> > >> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
> > >>
> > >>
> > >> Best Regards.
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message