Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm
Precedence: bulk
Message-Id: <201702240109.v1O19BHT026766@ip-10-146-233-104.ec2.internal>
Date: Fri, 24 Feb 2017 01:09:11 +0000
From: "Joe McDonnell (Code Review)" <gerrit@cloudera.org>
To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org
CC: Marcel Kornacker <marcel@cloudera.com>, Alex Behm <alex.behm@cloudera.com>,
        Lars Volker <lv@cloudera.com>, Mostafa Mokhtar <mmokhtar@cloudera.com>,
        Matthew Mulder <mmulder@cloudera.com>
Reply-To: joemcdonnell@cloudera.com
Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4624=3A_Implement_Parquet_dictionary_filtering=0A?=
In-Reply-To: <gerrit.1486172553000.I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7@gerrit.cloudera.org>
References: <gerrit.1486172553000.I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7@gerrit.cloudera.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Content-Disposition: inline
User-Agent: Gerrit/2.12.7
archived-at: Fri, 24 Feb 2017 01:09:35 -0000

Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................


Patch Set 9:

(28 comments)

http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 71: // is guaranteed to be true for Impala versions of 2.9 or below.
> "versions 2.9"
Done


Line 581:   /* Nested types are not supported yet */
> nit: use // style comment
Fixed comment

There is not a JIRA yet, I will file one.


Line 599:   // TODO: The values should be converted at dictionary read and store
> "during dictionary construction and stored"
Done


Line 600:   // in converted form in the dictionary.
> let's tackle this particular todo immediately after the current patch goes 
Normal string columns work. The NeedsConversion codepath is for char datatypes where it has a limit on length. char datatypes need shorter values to be padded with spaces and longer values to be cut to length.


Line 661:         if (enc_stat.encoding != parquet::Encoding::PLAIN_DICTIONARY) {
> Can the count ever be zero? The parquet.thrift file seems to allow it.
Good point, added a check of whether the count is nonzero.


Line 699:   if (file_version_.application == "impala" && file_version_.VersionLt(2,10,0)) {
> this will need to change to 2.9 later, because 2.9 will support encoding_st
Yes, I will change this in a followup change.


Line 706:     BaseScalarColumnReader* scalar_reader =
> why not make this list contain BaseScalarColumnReader*?
Pushed this to a followup change. I'm still uncertain about how this will end up interacting with nested types.


Line 715:       // This reader is not 100% dictionary encoded, so dictionary filters
> i think it's more accurate to say "we cannot say with certainty whether the
Done


Line 735:     if (dict_filter_it == scanner_dict_filter_map_.end()) DCHECK(false);
> You could use DCHECK(a != b)
Good point, done.


http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

PS9, Line 435: into
> s/into/to ?
Done


Line 443:   std::vector<int8_t> dict_filter_tuple_backing_;
> you can use ScopedBuffer here. take a look at 
Done


Line 446:   Tuple* dict_filter_tuple_;
> I'd initialize this when you need it, since it is just the result of a cast
Done


PS9, Line 618: (
> Why the (?
Fixed


Line 620:   /// structures needed for evaluating dictionary filtering.
> point out which member vars get populated (= side effects)
Done


Line 623:   /// Returns true if the column chunk is 100% dictionary encoded
> Can you make this more concrete? For example "if all pages of the column ch
Done


PS9, Line 630: row_group_eliminated
> Isn't the caller responsible for eliminating the row group? "_eliminated" s
Good point. Changed to match your code change.


http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 254:   map<Encoding::type, int32_t> dict_encoding_stats_;
> why not unordered_map? Also we I think we prefer int over int32_t
Done


http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS9, Line 499: NeedsCoversion
> NeedsConversion
Done


Line 761:   Status BaseScalarColumnReader::ReadPageHeader(bool peek,
> indentation
Done


Line 840:   // is not a dictionary, then there is no dictionary.
> how does this interact with the dictionary_page_offset? do we ignore that?
If set, we require dictionary_page_offset to be less than data_page_offset. Then, we start the stream at dictionary_page_offset. 

However, there are some cases where dictionary_page_offset is not set even though there is a dictionary. I see this on tpch_nested_parquet. The dictionary is at the start of the data stream followed by the data pages.

Since there is this variation, it is not reliable to make use of the dictionary_page_offset here.

I think there should be ways to skip InitDictionary for columns that don't have any dictionary. If we look at the column chunk encodings and PLAIN_DICTIONARY is not present, then I think we can expect that there is no dictionary.


Line 849:   if (UNLIKELY(data_size < 0)) {
> these checks aren't really related to dictionaries, can't we do that elsewh
Moved to ReadPageHeader. This was an artifact from previous structure.


http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 366:   // Initializes the dictionary, if it exists
> point out specific state that changes
Done


PS9, Line 427: hext_header_size
> *next_header_size
Done


http://gerrit.cloudera.org:8080/#/c/5904/9/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 229:    * TODO: It should be possible to use this in isConstantImpl.
> You could pass the function name into isDeterministic and then re-use this 
Created function isDeterministicFn to share this logic.


http://gerrit.cloudera.org:8080/#/c/5904/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 332:    * dictionaryFilterConjuncts_ map.
> you can leave it as-is, i don't mean to belabor the point, but putting this
That makes sense. Fixed this.


Line 682:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
> i think the min/max predicates are output at extended level, might as well 
Done


Line 693:         for (Integer idx : totalIdxList) { exprList.add(conjuncts_.get(idx)); }
> remove {}
Done


http://gerrit.cloudera.org:8080/#/c/5904/9/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      dictionary filter predicates: int_col > 1
> lars called his "parquet statistics predicates', so let's call these 'parqu
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Mulder <mmulder@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes