impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Thu, 23 Feb 2017 02:25:32 GMT
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................


Patch Set 8:

(31 comments)

Addressed the review comments. I also added a query option "parquet_dictionary_filtering"
which defaults to true.

http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 515:     RETURN_IF_ERROR(InitColumns(row_group_idx_, column_readers_));
> good point. InitColumns() needs to be split up so that the construction of 
I changed this so that we call InitColumns for only the dictionary filtering columns before
doing dictionary filtering. This means that the reads for the other columns will only occur
if the row group passes. 

To do better than this (i.e. to only read the dictionary), we will need to do what Marcel
suggests and rework the code more substantially.


Line 532:     RETURN_IF_ERROR(InitDictionaries(column_readers_));
> the control flow is hard to follow because it the top-level functions don't
Done


Line 607:     // values are not validated, so skip these datatypes for now.
> also leave todo, this can be done during dictionary construction.
Done


Line 615:   Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx,
> indentation
Done


Line 628:   for (ParquetColumnReader* col_reader : dict_filter_column_readers_) {
> there is a lot of code here that doesn't evaluate a dictionary. restructure
Done


Line 637:     vector<ExprContext*>& slot_conjunct_ctxs = slot_conjunct_it->second;
> move to where variables are used.
Done


Line 701:     // here or in ReadDataPage but never both.
> is that still true? i thought you expect the first page to be dictionary-en
Good point, corrected the comment.

The dictionary page offset is used to set the start of the stream. The dictionary page offset
is required to be less than the data page offset, so the stream needs to start at dictionary
page offset.


Line 710:     // TODO: handle the 40000 in a more elegant manner (make it a named constant)
> resolve todo (don't leave todos that take very little time to resolve).
Done


Line 723:         column_passes_filter = true;
> 'passes filter' is ambiguous (does this mean the predicate is true or false
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 613:   void InitDictionaryFilters();
> not intuitively named, i thought this had something to do with dictionary c
Changed the name


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 244:   unordered_set<Encoding::type> column_encodings_;
> missing comments
Done


Line 525:     dict_encoding_stats_[dict_header.encoding]++;
> ++...
Done


Line 611:   column_encodings_.insert(header.data_page_header.definition_level_encoding);
> is this useful/required? the def/rep level encodings don't vary.
Changed this to do the def/rep level encodings once at the start.


Line 1042:     for (Encoding::type encoding : columns_[i]->column_encodings_) {
> you use columns_[i] a lot, create a variable
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS8, Line 462:  NeedsConversionInline
> can this be another template parameter?
The callers of ReadSlot are on the fast path as well, so those locations would need to know
what value to pass into ReadSlot and would need the inlined version.


Line 764:   // end of the stream.
> superfluous comment, the function comment in the header file already makes 
Done


Line 804:       new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */
true);
> indentation
Done


Line 811:     if (buffer_size == new_buffer_size) {
> what does this mean?
Added a comment


Line 991:         return Status("Column chunk should not contain two dictionary pages.");
> make this an actionable error message (ie, include the file name).
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 431:   /// an indication of whether a dictionary is present.
> so this is true if no dictionary exists?
Removed this


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

Line 55:   for (T i: values) {
> do a similar loop for GetValue()
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 1: /**
> is this the unmodified parquet.thrift? (and it comes without a format versi
Yes, this is the file I found on github. I fixed some whitespace, but there is no version
information.


http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 190: 
> avoid double-spaced code
Done


Line 332:    * that are bound by only that slot. The single-slot restriction allows
> explicitly mention any member variables that get populated (= spell out sid
Done


Line 349:       boolean isDeterministic = true;
> instead, add a new predicate to Expr.java (IS_DETERMINISTIC_FN_PREDICATE), 
Added IS_NONDETERMINISTIC_FN_PREDICATE and reworked this code.


Line 359:       // This is important for dictionary filtering. Dictionaries do not
> so these predicates aren't just 'single-slot conjuncts', because they're sp
Renamed throughout the code.


Line 692:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
> precede w/ blank line
Done


Line 698:         Collections.sort(totalIdxList);
> why sort these?
Added a comment. I'm sorting to display the predicates in the same order as the other predicate
lists.


Line 700:         for (Integer idx : totalIdxList) {
> single line
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      single slot predicates: int_col > 1
> might as well call this something like 'dictionary predicates', to make exp
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

Line 1: ====
> Please add coverage for nested data filtering as well.
Added a query that does dictionary filtering at the top level on tpch_nested_parquet.customer.


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Mulder <mmulder@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message