impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sailesh Mukil <sail...@cloudera.com>
Subject Min/Max runtime filtering on Impala-Kudu
Date Mon, 27 Mar 2017 18:42:19 GMT
I will be working on a patch to add min/max filter support in Impala, and
as a first step, specifically target the KuduScanNode, since the Kudu
client is already able to accept a Min and a Max that it would internally
use to filter during its scans. Below is a brief design proposal.

*Goal:*

To leverage runtime min/max filter support in Kudu for the potential speed
up of queries over Kudu tables. Kudu does this by taking a min and a max
that Impala will provide and only return values in the range Impala is
interested in.

*[min <= range we're interested in >= max]*

*Proposal:*


   - As a first step, plumb the runtime filter code from
*exec/hdfs-scan-node-base.cc/h
   <http://hdfs-scan-node-base.cc/h>* to *exec/scan-node.cc/h
   <http://scan-node.cc/h>*, so that it can be applied to *KuduScanNode*
   cleanly as well, since *KuduScanNode* and *HdfsScanNodeBase* both
   inherit from *ScanNode.*
   - Reuse the *ColumnStats* class (exec/parquet-column-stats.h) or
   implement a lighter weight version of it to process and store the Min and
   the Max on the build side of the join.
   - Once the Min and Max values are added to the existing runtime filter
   structures, as a first step, we will ignore the Min and Max values for
   non-Kudu tables. Using them for non-Kudu tables can come in as a following
   patch(es).
   - Similarly, the bloom filter will be ignored for Kudu tables, and only
   the Min and Max values will be used, since Kudu does not accept bloom
   filters yet. (https://issues.apache.org/jira/browse/IMPALA-3741)
   - Applying the bloom filter on the Impala side of the Kudu scan (i.e. in
   KuduScanNode) is not in the scope of this patch.


*Complications:*

   - We have to make sure that finding the Min and Max values on the build
   side doesn't regress certain workloads, since the difference between
   generating a bloom filter and generating a Min and a Max, is that a bloom
   filter can be type agnostic (we just take a raw hash over the data) whereas
   a Min and a Max have to be type specific.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message