impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering
Date Wed, 01 Mar 2017 23:06:17 GMT
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................


Patch Set 16:

(6 comments)

Fixed an issue that was causing test failure in TestParquet::test_corrupt_files and addressed
Tim's comments.

I'm doing a pre review test job to check the changes.

http://gerrit.cloudera.org:8080/#/c/5904/14//COMMIT_MSG
Commit Message:

Line 7: IMPALA-4624: Implement Parquet dictionary filtering
> Can you mention the query option in the commit message?
Done


PS14, Line 12: incides
> indices
Done


http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 704:       state_->query_options().parquet_dictionary_filtering;
> Can you use MemTracker::MemLimitExceeded() here to construct the status? It
Done


http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS14, Line 256: GetDictionary
> GetDictionaryDecoder() for consistency with the other APIs?
Done


Line 865:     if (!stream_->ReadBytes(data_size, &data_, &status)) return status;
> Not your change, but can we just SkipBytes here? That would make the intent
Done


http://gerrit.cloudera.org:8080/#/c/5904/14/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 243:   public boolean isDeterministic() {
> I think we should be careful about the naming and comments here, since this
Switched this over to isBuiltinRandom/isBuiltinRandomFn. I agree that it is useful to be specific
about what this function can classify.

The issue for dictionary filtering is that anything that is non-deterministic should not be
applied as a dictionary filter. If we skip a whole row group based on some randomness, that
can produce incorrect results. 

We should think about what we want to ship. If UDFs can be nondeterministic, we should think
about disabling dictionary filtering for them.

Things like now() are fine, as it is constant over an execution of a query. It would not impact
correctness.


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Mulder <mmulder@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message