impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Kornacker <>
Subject Re: Min/Max runtime filtering on Impala-Kudu
Date Mon, 27 Mar 2017 18:49:56 GMT
On Mon, Mar 27, 2017 at 11:42 AM, Sailesh Mukil <> wrote:
> I will be working on a patch to add min/max filter support in Impala, and
> as a first step, specifically target the KuduScanNode, since the Kudu
> client is already able to accept a Min and a Max that it would internally
> use to filter during its scans. Below is a brief design proposal.
> *Goal:*
> To leverage runtime min/max filter support in Kudu for the potential speed
> up of queries over Kudu tables. Kudu does this by taking a min and a max
> that Impala will provide and only return values in the range Impala is
> interested in.
> *[min <= range we're interested in >= max]*
> *Proposal:*
>    - As a first step, plumb the runtime filter code from
> *exec/
>    <>* to *exec/
>    <>*, so that it can be applied to *KuduScanNode*
>    cleanly as well, since *KuduScanNode* and *HdfsScanNodeBase* both
>    inherit from *ScanNode.*

Quick comment: please make sure your solution also applies to KuduScanNodeMt.

>    - Reuse the *ColumnStats* class (exec/parquet-column-stats.h) or
>    implement a lighter weight version of it to process and store the Min and
>    the Max on the build side of the join.
>    - Once the Min and Max values are added to the existing runtime filter
>    structures, as a first step, we will ignore the Min and Max values for
>    non-Kudu tables. Using them for non-Kudu tables can come in as a following
>    patch(es).
>    - Similarly, the bloom filter will be ignored for Kudu tables, and only
>    the Min and Max values will be used, since Kudu does not accept bloom
>    filters yet. (
>    - Applying the bloom filter on the Impala side of the Kudu scan (i.e. in
>    KuduScanNode) is not in the scope of this patch.
> *Complications:*
>    - We have to make sure that finding the Min and Max values on the build
>    side doesn't regress certain workloads, since the difference between
>    generating a bloom filter and generating a Min and a Max, is that a bloom
>    filter can be type agnostic (we just take a raw hash over the data) whereas
>    a Min and a Max have to be type specific.

View raw message