impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Mon, 27 Feb 2017 20:54:11 GMT
Hello Marcel Kornacker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5904

to look at the new patch set (#15).

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................

IMPALA-4624: Implement Parquet dictionary filtering

Here is a basic summary of the changes:
Frontend looks for conjuncts that operate on a single slot and pass a
map from slot id to the conjunct index through thrift to the backend.
The conjunct indices are the incides into the normal PlanNode conjuncts list.
The conjuncts need to satisfy certain conditions:
1. They are bound on a single slot
2. They are deterministic (no random functions)
3. They evaluate to FALSE on a NULL input. This is because the dictionary
does not include NULLs, so any condition that evaluates to TRUE on NULL
cannot be evaluated by looking only at the dictionary.

The backend converts the indices into ExprContexts. These are cloned in
the scanner threads.

The dictionary read codepath has been removed from ReadDataPage into its
own function, InitDictionary. This has also been turned into its own step
in row group initialization. ReadDataPage will not see any dictionary
pages unless the parquet file is invalid.

For dictionary filtering, we initialize dictionaries only as needed to evaluate
the conjuncts. The Parquet scanner evaluates the dictionary filter conjuncts on the
dictionary to see if any dictionary entry passes. If no entry passes, the row
group is eliminated. If the row group passes the dictionary filtering, then we
initialize all remaining dictionaries.

Since column chunks can have a mixture of encodings, dictionary filtering
uses three tests to determine whether this is purely dictionary encoded:
1. If the encoding_stats is in the parquet file, then use it to determine if
there are only dictionary encoded pages (i.e. there are no data pages with
an encoding other than PLAIN_DICTIONARY).
-OR-
2. If the encoding stats are not present, then look at the encodings. The column
is purely dictionary encoded if:
a) PLAIN_DICTIONARY is present
AND
b) Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed
-OR-
3. If this file was written by an older version of Impala, then we know that
dictionary failover happens when the dictionary reaches 40,000 values.
Dictionary filtering can proceed as long as the dictionary is smaller than
that.

parquet-mr writes the encoding list correctly in the current version in our
environment (1.5.0). This means that check #2 works on some existing files
(potentially most existing parquet-mr files).
parquet-mr writes the encoding stats starting in 1.9.0. This is the version
where check #1 will start working.

Impala's parquet writer now implements both, so either check above will work.

Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/parquet.thrift
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
A testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
A testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-filtering.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
M tests/query_test/test_scanners.py
27 files changed, 1,420 insertions(+), 187 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/5904/15
-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Mulder <mmulder@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>

Mime
View raw message