impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Thu, 23 Feb 2017 02:25:32 GMT
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering

Patch Set 8:


Addressed the review comments. I also added a query option "parquet_dictionary_filtering"
which defaults to true.
File be/src/exec/

Line 515:     RETURN_IF_ERROR(InitColumns(row_group_idx_, column_readers_));
> good point. InitColumns() needs to be split up so that the construction of 
I changed this so that we call InitColumns for only the dictionary filtering columns before
doing dictionary filtering. This means that the reads for the other columns will only occur
if the row group passes. 

To do better than this (i.e. to only read the dictionary), we will need to do what Marcel
suggests and rework the code more substantially.

Line 532:     RETURN_IF_ERROR(InitDictionaries(column_readers_));
> the control flow is hard to follow because it the top-level functions don't

Line 607:     // values are not validated, so skip these datatypes for now.
> also leave todo, this can be done during dictionary construction.

Line 615:   Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx,
> indentation

Line 628:   for (ParquetColumnReader* col_reader : dict_filter_column_readers_) {
> there is a lot of code here that doesn't evaluate a dictionary. restructure

Line 637:     vector<ExprContext*>& slot_conjunct_ctxs = slot_conjunct_it->second;
> move to where variables are used.

Line 701:     // here or in ReadDataPage but never both.
> is that still true? i thought you expect the first page to be dictionary-en
Good point, corrected the comment.

The dictionary page offset is used to set the start of the stream. The dictionary page offset
is required to be less than the data page offset, so the stream needs to start at dictionary
page offset.

Line 710:     // TODO: handle the 40000 in a more elegant manner (make it a named constant)
> resolve todo (don't leave todos that take very little time to resolve).

Line 723:         column_passes_filter = true;
> 'passes filter' is ambiguous (does this mean the predicate is true or false
File be/src/exec/hdfs-parquet-scanner.h:

Line 613:   void InitDictionaryFilters();
> not intuitively named, i thought this had something to do with dictionary c
Changed the name
File be/src/exec/

Line 244:   unordered_set<Encoding::type> column_encodings_;
> missing comments

Line 525:     dict_encoding_stats_[dict_header.encoding]++;
> ++...

Line 611:   column_encodings_.insert(header.data_page_header.definition_level_encoding);
> is this useful/required? the def/rep level encodings don't vary.
Changed this to do the def/rep level encodings once at the start.

Line 1042:     for (Encoding::type encoding : columns_[i]->column_encodings_) {
> you use columns_[i] a lot, create a variable
File be/src/exec/

PS8, Line 462:  NeedsConversionInline
> can this be another template parameter?
The callers of ReadSlot are on the fast path as well, so those locations would need to know
what value to pass into ReadSlot and would need the inlined version.

Line 764:   // end of the stream.
> superfluous comment, the function comment in the header file already makes 

Line 804:       new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */
> indentation

Line 811:     if (buffer_size == new_buffer_size) {
> what does this mean?
Added a comment

Line 991:         return Status("Column chunk should not contain two dictionary pages.");
> make this an actionable error message (ie, include the file name).
File be/src/exec/parquet-column-readers.h:

Line 431:   /// an indication of whether a dictionary is present.
> so this is true if no dictionary exists?
Removed this
File be/src/util/

Line 55:   for (T i: values) {
> do a similar loop for GetValue()
File common/thrift/parquet.thrift:

Line 1: /**
> is this the unmodified parquet.thrift? (and it comes without a format versi
Yes, this is the file I found on github. I fixed some whitespace, but there is no version
File fe/src/main/java/org/apache/impala/planner/

Line 190: 
> avoid double-spaced code

Line 332:    * that are bound by only that slot. The single-slot restriction allows
> explicitly mention any member variables that get populated (= spell out sid

Line 349:       boolean isDeterministic = true;
> instead, add a new predicate to (IS_DETERMINISTIC_FN_PREDICATE), 
Added IS_NONDETERMINISTIC_FN_PREDICATE and reworked this code.

Line 359:       // This is important for dictionary filtering. Dictionaries do not
> so these predicates aren't just 'single-slot conjuncts', because they're sp
Renamed throughout the code.

Line 692:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
> precede w/ blank line

Line 698:         Collections.sort(totalIdxList);
> why sort these?
Added a comment. I'm sorting to display the predicates in the same order as the other predicate

Line 700:         for (Integer idx : totalIdxList) {
> single line
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      single slot predicates: int_col > 1
> might as well call this something like 'dictionary predicates', to make exp
File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

Line 1: ====
> Please add coverage for nested data filtering as well.
Added a query that does dictionary filtering at the top level on tpch_nested_parquet.customer.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Joe McDonnell <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Matthew Mulder <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-HasComments: Yes

View raw message