impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Sat, 18 Feb 2017 00:15:22 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................


Patch Set 8:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 532:     RETURN_IF_ERROR(InitDictionaries(column_readers_));
the control flow is hard to follow because it the top-level functions don't reflect it.

why not divide the column readers into two groups (w/ dict filter and w/o) and have two explicit
loops: a) loop over first group: initialize dictionary, check whether it's a filtering candidate,
then evaluate filter; b) loop over second group: initialize dictionary. that way, you can
also get rid of dict_initialized_.


Line 607:     // values are not validated, so skip these datatypes for now.
also leave todo, this can be done during dictionary construction.


Line 615:   Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx,
indentation

pass in RowGroup& directly, you don't use the index for anything other than indexing into
row_groups


Line 628:   for (ParquetColumnReader* col_reader : dict_filter_column_readers_) {
there is a lot of code here that doesn't evaluate a dictionary. restructure it to make the
division of labor and control flow explicit.


Line 637:     vector<ExprContext*>& slot_conjunct_ctxs = slot_conjunct_it->second;
move to where variables are used.


Line 701:     // here or in ReadDataPage but never both.
is that still true? i thought you expect the first page to be dictionary-encoded? also, where
is the dict page offset from the metadata being utilized?


Line 710:     // TODO: handle the 40000 in a more elegant manner (make it a named constant)
resolve todo (don't leave todos that take very little time to resolve).


Line 723:         column_passes_filter = true;
'passes filter' is ambiguous (does this mean the predicate is true or false?). has_match is
not.


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 613:   void InitDictionaryFilters();
not intuitively named, i thought this had something to do with dictionary construction (since
the dictionary is in effect the filter). rename.


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 244:   unordered_set<Encoding::type> column_encodings_;
missing comments


Line 525:     dict_encoding_stats_[dict_header.encoding]++;
++...


Line 611:   column_encodings_.insert(header.data_page_header.definition_level_encoding);
is this useful/required? the def/rep level encodings don't vary.


Line 1042:     for (Encoding::type encoding : columns_[i]->column_encodings_) {
you use columns_[i] a lot, create a variable


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS8, Line 462:  NeedsConversionInline
can this be another template parameter?


Line 764:   // end of the stream.
superfluous comment, the function comment in the header file already makes that clear


Line 804:       new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */
true);
indentation


Line 811:     if (buffer_size == new_buffer_size) {
what does this mean?


Line 991:         return Status("Column chunk should not contain two dictionary pages.");
make this an actionable error message (ie, include the file name).


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 431:   /// an indication of whether a dictionary is present.
so this is true if no dictionary exists?


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

Line 55:   for (T i: values) {
do a similar loop for GetValue()


http://gerrit.cloudera.org:8080/#/c/5904/8/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 1: /**
is this the unmodified parquet.thrift? (and it comes without a format version indication?)


http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 190: 
avoid double-spaced code


Line 332:    * that are bound by only that slot. The single-slot restriction allows
explicitly mention any member variables that get populated (= spell out side effects).


Line 349:       boolean isDeterministic = true;
instead, add a new predicate to Expr.java (IS_DETERMINISTIC_FN_PREDICATE), this looks like
it would be generally useful.


Line 359:       // This is important for dictionary filtering. Dictionaries do not
so these predicates aren't just 'single-slot conjuncts', because they're specifically targeted
to the requirements of dictionary filtering.

rename.


Line 692:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
precede w/ blank line


Line 698:         Collections.sort(totalIdxList);
why sort these?


Line 700:         for (Integer idx : totalIdxList) {
single line


http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      single slot predicates: int_col > 1
might as well call this something like 'dictionary predicates', to make explicit that it is
evaluated against a dictionary (plus, 'slot' is internal jargon), and also because we don't
have plans to use it for anything else (since min/max predicates have different restrictions).


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message