drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hsuan Yi Chu <hyi...@maprtech.com>
Subject Re: In list filter evaluation : room for improvement in run-time code generation.
Date Thu, 10 Sep 2015 20:12:04 GMT
Can we cartesian-join all the values in the in list and rewrite it as a
single in list:

For example,
Say, the original where-clause is

"a in (1, 2) or b in (3, 4)"

Can we implement a rule to let calcite treat it as

"(a, b) in ((1,3),(1,4),(2,3),(2,4))"

--------------------------------------------------------------
I understand the exponential growth in cartesian-join. But I am guessing
maybe in some circumstances, this does help, isn't it???

On Thu, Sep 10, 2015 at 11:04 AM, Jinfeng Ni <jinfengni99@gmail.com> wrote:

> I think Semi-join is not valid in this case, since the original query
> has 5 in-lists ORed together. If Semi-join is used, then the rows
> that does not qualify for the first 1 in-list filter would be pruned out,
> which is not valid, since they may qualify for the second in-list filter.
>
> That's why left outer join is used, if the planner decide to use join
> approach. The problem is that left outer join does not reduce the
> # of rows; essentially, Drill execution has to scan the input rows
> multiple times, making the join operator a bottleneck for the query.
>
>
>
> On Thu, Sep 10, 2015 at 10:40 AM, Hsuan Yi Chu <hyichu@maprtech.com>
> wrote:
>
> > I believe the usage of Semi-Join had been proposed before.
> >
> > Would that new operator help in this scenario you think?
> >
> > On Wed, Sep 9, 2015 at 8:16 PM, Jinfeng Ni <jinfengni99@gmail.com>
> wrote:
> >
> > > The reason that the in-list join approach is not fast enough :
> > > the query has 5 in-lists ORed together. Each in-list is converted
> > > to a left outer join.  After the 5 left outer join, there is a filter.
> > >
> > > Since left outer join does not prune any row from left side,
> > > which is the base table in this case, essentially each join has
> > > to scan the same # of rows as the base table, and copy
> > > to the outgoing batch. That is, although the in-list evaluation
> > > is using hash-based probe, which is faster than the original
> > > filter evaluation, still 5 left out join incurs big overhead
> > > in scanning/copying the data.
> > >
> > > The UDF idea in #2 is essentially doing the same kind of hash-based
> > > probe in filter evaluation. The hash-table will be initialized as
> > > a workspace variable in the doSetup(). Then, the doEval() will
> > > simply probe the hash-table.  I feel it would achieve the same
> > > benefit of join approach, while avoid the overhead of re-scanning
> > > the data multiple times.
> > >
> > > However, the current infrastructure seems miss the support
> > > of VarArg in Drill's build-in or UDF, which is required to implement
> > > this idea.
> > >
> > >
> > >
> > > On Wed, Sep 9, 2015 at 5:40 PM, Aman Sinha <amansinha@apache.org>
> wrote:
> > >
> > > > Yes, this would be a good enhancement.  Any improvement to the
> > > > efficiency/compactness of the generated code is complimentary to
> other
> > > > optimizations such as parquet filter pushdown.  I recall that there
> > was a
> > > > JIRA a while ago with hundreds or thousands of filter conditions
> > > creating a
> > > > really bloated generated code  - we should revisit that at some point
> > to
> > > > identify scope for improvement.
> > > > I am not so sure about the UDF suggestion in #2.   It seems like
> > > > identifying why the large IN-list join approach was slow and fixing
> > that
> > > > would be a general solution.
> > > >
> > > > Aman
> > > >
> > > > On Wed, Sep 9, 2015 at 1:31 PM, Jinfeng Ni <jinfengni99@gmail.com>
> > > wrote:
> > > >
> > > > > Weeks ago there was a message on drill user list, reporting
> > performance
> > > > > issues caused by in list filter [1].  The query has filter:
> > > > >
> > > > > WHERE
> > > > >    c0 IN (v_00, v_01, v_02, v_03, ... )
> > > > > OR
> > > > >    c1 IN (v_11, v_11, v_12, v_13, ....)
> > > > > OR
> > > > >    c2 IN ...
> > > > > OR
> > > > >    c3 IN ...
> > > > > OR
> > > > >    ....
> > > > >
> > > > > The profile shows that most of query time is spent on filter
> > > evaluation.
> > > > > One workaround that we recommend was to re-write the query so that
> > the
> > > > > planner would convert in list into join operation. Turns out that
> > > > > converting
> > > > > into join did help improve performance, but not as much as we
> wanted.
> > > > >
> > > > > The original query has parquet as the data source. Therefore, the
> > ideal
> > > > > solution is parquet filter pushdown, which DRILL-1950 would
> address.
> > > > >
> > > > > On the other hand, I noticed that there seems to be room for
> > > improvement
> > > > > in the run-time generated code. In particular, for " c0 in (v_00,
> > v_01,
> > > > > ...)",
> > > > > Drill will evaluate it as :
> > > > >     c0 = v_00  OR c0 = v_01 OR ...
> > > > >
> > > > > Each reference of "c0" will lead to initialization of vector and
> > holder
> > > > > assignment in the generated code. There is redundant evaluation for
> > > > > the common reference.
> > > > >
> > > > > I put together a patch,which will avoid the redundant evaluation
> for
> > > the
> > > > > common reference.  Using TPCH scale factor 10's lineitem table, I
> saw
> > > > > quite surprising improvement. (run on Mac with embedded drillbit)
> > > > >
> > > > > 1) In List uses integer type [2]
> > > > >   master branch :  12.53 seconds
> > > > >   patch on top of master branch : 7.073 seconds
> > > > > That's almost 45% improvement.
> > > > >
> > > > > 2) In List uses binary type [3]
> > > > >   master branch :  198.668 seconds
> > > > > patch on top of master branch: 20.37 seconds
> > > > >
> > > > > Two thoughts:
> > > > > 1. Will code size impact Janino compiler optimization or jvm
> hotspot
> > > > > optimization? Otherwise, it seems hard to explain the performance
> > > > > difference of removing the redundant evaluation. That might imply
> > > > > that the efficiency of run-time generated code may degrade with
> > > > > more expressions in the query (?)
> > > > >
> > > > > 2. For In-List filter, it might make sense to create a Drill UDF.
> The
> > > > > UDF will build a heap-based hashtable in setup, in a similar way
> > > > > as what the join approach will do.
> > > > >
> > > > >  I'm going to open a JIRA to submit the patch for review, as I feel
> > > > > it will benefit not only the in list filter, but also expressions
> > with
> > > > > common column references.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAC-7oTym0Yzr2RmXhDPag6k41se-uTkWu0QC%3DMABb7s94DJ0BA%40mail.gmail.com%3E
> > > > >
> > > > > [2] https://gist.github.com/jinfengni/7f6df9ed7d2c761fed33
> > > > >
> > > > > [3]  https://gist.github.com/jinfengni/7460f6d250f0d00009ed
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message