impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Fri, 24 Feb 2017 01:09:11 GMT
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering

Patch Set 9:

File be/src/exec/

Line 71: // is guaranteed to be true for Impala versions of 2.9 or below.
> "versions 2.9"

Line 581:   /* Nested types are not supported yet */
> nit: use // style comment
Fixed comment

There is not a JIRA yet, I will file one.

Line 599:   // TODO: The values should be converted at dictionary read and store
> "during dictionary construction and stored"

Line 600:   // in converted form in the dictionary.
> let's tackle this particular todo immediately after the current patch goes 
Normal string columns work. The NeedsConversion codepath is for char datatypes where it has
a limit on length. char datatypes need shorter values to be padded with spaces and longer
values to be cut to length.

Line 661:         if (enc_stat.encoding != parquet::Encoding::PLAIN_DICTIONARY) {
> Can the count ever be zero? The parquet.thrift file seems to allow it.
Good point, added a check of whether the count is nonzero.

Line 699:   if (file_version_.application == "impala" && file_version_.VersionLt(2,10,0))
> this will need to change to 2.9 later, because 2.9 will support encoding_st
Yes, I will change this in a followup change.

Line 706:     BaseScalarColumnReader* scalar_reader =
> why not make this list contain BaseScalarColumnReader*?
Pushed this to a followup change. I'm still uncertain about how this will end up interacting
with nested types.

Line 715:       // This reader is not 100% dictionary encoded, so dictionary filters
> i think it's more accurate to say "we cannot say with certainty whether the

Line 735:     if (dict_filter_it == scanner_dict_filter_map_.end()) DCHECK(false);
> You could use DCHECK(a != b)
Good point, done.
File be/src/exec/hdfs-parquet-scanner.h:

PS9, Line 435: into
> s/into/to ?

Line 443:   std::vector<int8_t> dict_filter_tuple_backing_;
> you can use ScopedBuffer here. take a look at 

Line 446:   Tuple* dict_filter_tuple_;
> I'd initialize this when you need it, since it is just the result of a cast

PS9, Line 618: (
> Why the (?

Line 620:   /// structures needed for evaluating dictionary filtering.
> point out which member vars get populated (= side effects)

Line 623:   /// Returns true if the column chunk is 100% dictionary encoded
> Can you make this more concrete? For example "if all pages of the column ch

PS9, Line 630: row_group_eliminated
> Isn't the caller responsible for eliminating the row group? "_eliminated" s
Good point. Changed to match your code change.
File be/src/exec/

Line 254:   map<Encoding::type, int32_t> dict_encoding_stats_;
> why not unordered_map? Also we I think we prefer int over int32_t
File be/src/exec/

PS9, Line 499: NeedsCoversion
> NeedsConversion

Line 761:   Status BaseScalarColumnReader::ReadPageHeader(bool peek,
> indentation

Line 840:   // is not a dictionary, then there is no dictionary.
> how does this interact with the dictionary_page_offset? do we ignore that?
If set, we require dictionary_page_offset to be less than data_page_offset. Then, we start
the stream at dictionary_page_offset. 

However, there are some cases where dictionary_page_offset is not set even though there is
a dictionary. I see this on tpch_nested_parquet. The dictionary is at the start of the data
stream followed by the data pages.

Since there is this variation, it is not reliable to make use of the dictionary_page_offset

I think there should be ways to skip InitDictionary for columns that don't have any dictionary.
If we look at the column chunk encodings and PLAIN_DICTIONARY is not present, then I think
we can expect that there is no dictionary.

Line 849:   if (UNLIKELY(data_size < 0)) {
> these checks aren't really related to dictionaries, can't we do that elsewh
Moved to ReadPageHeader. This was an artifact from previous structure.
File be/src/exec/parquet-column-readers.h:

Line 366:   // Initializes the dictionary, if it exists
> point out specific state that changes

PS9, Line 427: hext_header_size
> *next_header_size
File fe/src/main/java/org/apache/impala/analysis/

Line 229:    * TODO: It should be possible to use this in isConstantImpl.
> You could pass the function name into isDeterministic and then re-use this 
Created function isDeterministicFn to share this logic.
File fe/src/main/java/org/apache/impala/planner/

Line 332:    * dictionaryFilterConjuncts_ map.
> you can leave it as-is, i don't mean to belabor the point, but putting this
That makes sense. Fixed this.

Line 682:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
> i think the min/max predicates are output at extended level, might as well 

Line 693:         for (Integer idx : totalIdxList) { exprList.add(conjuncts_.get(idx)); }
> remove {}
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      dictionary filter predicates: int_col > 1
> lars called his "parquet statistics predicates', so let's call these 'parqu

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Joe McDonnell <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Matthew Mulder <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-HasComments: Yes

View raw message