impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Sat, 18 Feb 2017 00:15:22 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering

Patch Set 8:

File be/src/exec/

Line 532:     RETURN_IF_ERROR(InitDictionaries(column_readers_));
the control flow is hard to follow because it the top-level functions don't reflect it.

why not divide the column readers into two groups (w/ dict filter and w/o) and have two explicit
loops: a) loop over first group: initialize dictionary, check whether it's a filtering candidate,
then evaluate filter; b) loop over second group: initialize dictionary. that way, you can
also get rid of dict_initialized_.

Line 607:     // values are not validated, so skip these datatypes for now.
also leave todo, this can be done during dictionary construction.

Line 615:   Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx,

pass in RowGroup& directly, you don't use the index for anything other than indexing into

Line 628:   for (ParquetColumnReader* col_reader : dict_filter_column_readers_) {
there is a lot of code here that doesn't evaluate a dictionary. restructure it to make the
division of labor and control flow explicit.

Line 637:     vector<ExprContext*>& slot_conjunct_ctxs = slot_conjunct_it->second;
move to where variables are used.

Line 701:     // here or in ReadDataPage but never both.
is that still true? i thought you expect the first page to be dictionary-encoded? also, where
is the dict page offset from the metadata being utilized?

Line 710:     // TODO: handle the 40000 in a more elegant manner (make it a named constant)
resolve todo (don't leave todos that take very little time to resolve).

Line 723:         column_passes_filter = true;
'passes filter' is ambiguous (does this mean the predicate is true or false?). has_match is
File be/src/exec/hdfs-parquet-scanner.h:

Line 613:   void InitDictionaryFilters();
not intuitively named, i thought this had something to do with dictionary construction (since
the dictionary is in effect the filter). rename.
File be/src/exec/

Line 244:   unordered_set<Encoding::type> column_encodings_;
missing comments

Line 525:     dict_encoding_stats_[dict_header.encoding]++;

Line 611:   column_encodings_.insert(header.data_page_header.definition_level_encoding);
is this useful/required? the def/rep level encodings don't vary.

Line 1042:     for (Encoding::type encoding : columns_[i]->column_encodings_) {
you use columns_[i] a lot, create a variable
File be/src/exec/

PS8, Line 462:  NeedsConversionInline
can this be another template parameter?

Line 764:   // end of the stream.
superfluous comment, the function comment in the header file already makes that clear

Line 804:       new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */

Line 811:     if (buffer_size == new_buffer_size) {
what does this mean?

Line 991:         return Status("Column chunk should not contain two dictionary pages.");
make this an actionable error message (ie, include the file name).
File be/src/exec/parquet-column-readers.h:

Line 431:   /// an indication of whether a dictionary is present.
so this is true if no dictionary exists?
File be/src/util/

Line 55:   for (T i: values) {
do a similar loop for GetValue()
File common/thrift/parquet.thrift:

Line 1: /**
is this the unmodified parquet.thrift? (and it comes without a format version indication?)
File fe/src/main/java/org/apache/impala/planner/

Line 190: 
avoid double-spaced code

Line 332:    * that are bound by only that slot. The single-slot restriction allows
explicitly mention any member variables that get populated (= spell out side effects).

Line 349:       boolean isDeterministic = true;
instead, add a new predicate to (IS_DETERMINISTIC_FN_PREDICATE), this looks like
it would be generally useful.

Line 359:       // This is important for dictionary filtering. Dictionaries do not
so these predicates aren't just 'single-slot conjuncts', because they're specifically targeted
to the requirements of dictionary filtering.


Line 692:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
precede w/ blank line

Line 698:         Collections.sort(totalIdxList);
why sort these?

Line 700:         for (Integer idx : totalIdxList) {
single line
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      single slot predicates: int_col > 1
might as well call this something like 'dictionary predicates', to make explicit that it is
evaluated against a dictionary (plus, 'slot' is internal jargon), and also because we don't
have plans to use it for anything else (since min/max predicates have different restrictions).

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Joe McDonnell <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-HasComments: Yes

View raw message