impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Thu, 23 Feb 2017 16:03:50 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................


Patch Set 9:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 71: // is guaranteed to be true for Impala versions of 2.9 or below.
"versions 2.9"


Line 599:   // TODO: The values should be converted at dictionary read and store
"during dictionary construction and stored"


Line 600:   // in converted form in the dictionary.
let's tackle this particular todo immediately after the current patch goes in, having dictionary
filtering for string columns is going to be very useful.


Line 699:   if (file_version_.application == "impala" && file_version_.VersionLt(2,10,0))
{
this will need to change to 2.9 later, because 2.9 will support encoding_stats


Line 706:     BaseScalarColumnReader* scalar_reader =
why not make this list contain BaseScalarColumnReader*?


Line 715:       // This reader is not 100% dictionary encoded, so dictionary filters
i think it's more accurate to say "we cannot say with certainty whether the columns if 100%
..."


http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 443:   std::vector<int8_t> dict_filter_tuple_backing_;
you can use ScopedBuffer here. take a look at 

https://gerrit.cloudera.org/#/c/6032/11/be/src/exec/hdfs-parquet-scanner.h


Line 620:   /// structures needed for evaluating dictionary filtering.
point out which member vars get populated (= side effects)

in other words, you could have also phrased this as "populates dict_filterable_readers_ and
non_dict_filterable_readers_".


http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 761:   Status BaseScalarColumnReader::ReadPageHeader(bool peek,
indentation


Line 840:   // is not a dictionary, then there is no dictionary.
how does this interact with the dictionary_page_offset? do we ignore that?


Line 849:   if (UNLIKELY(data_size < 0)) {
these checks aren't really related to dictionaries, can't we do that elsewhere?

fine to leave todo.


http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 366:   // Initializes the dictionary, if it exists
point out specific state that changes


http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 698:     return output.toString();
> Added a comment. I'm sorting to display the predicates in the same order as
but isn't the unsorted order the one in which they're being applied?


http://gerrit.cloudera.org:8080/#/c/5904/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 332:    * dictionaryFilterConjuncts_ map.
you can leave it as-is, i don't mean to belabor the point, but putting this more concisely
would also be fine:

"Walks through conjuncts and populates dictionaryFilterConjuncts_."

(you already described what that variable means elsewhere)


Line 682:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
i think the min/max predicates are output at extended level, might as well follow that.


Line 693:         for (Integer idx : totalIdxList) { exprList.add(conjuncts_.get(idx)); }
remove {}


http://gerrit.cloudera.org:8080/#/c/5904/9/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      dictionary filter predicates: int_col > 1
lars called his "parquet statistics predicates', so let's call these 'parquet dictionary predicates'


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Mulder <mmulder@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message