impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Date Wed, 22 Feb 2017 00:34:02 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
......................................................................


Patch Set 4:

(26 comments)

Thanks for the review. Please see PS6.

http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 472:   ScopedBuffer tuple_buffer(scan_node_->mem_tracker());
> no need to allocate this and go through the setup for every single row grou
Done


Line 486:   conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size());
> there's too much memory allocation going on for what are essentially static
Done


Line 496:     unique_ptr<ColumnStatsBase> stats(
> there's no need for memory allocation here
Done


Line 501:     if (e->function_name() == "lt" || e->function_name() == "le") {
> introduce named constants (in the appropriate class)
Other code that compares function_name follows the same pattern. As discussed in person, let's
keep this for now and see if we find a nicer abstraction to determine when to use min/max
values, e.g. monotonically bound predicates.


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 378:   /// Min/max statistics contexts.
> who owns them?
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 182:   if (min_max_tuple_id_ > -1) {
> ">" or < don't make sense for tuple ids
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 40: /// writing parquet files. It provides an interface to populate a parquet::Statistics
> this description isn't true anymore, it's now also an intermediate staging 
I removed the expanded abstraction and added a sentence to the comment explaining that it
can also be used to decode stats.


Line 70:   /// 'nullptr' for unsupported types. The caller is responsible for freeing the
memory.
> this should be created in an objectpool
Code has been removed.


Line 87:   virtual void WriteMinToBuffer(void* buffer) const = 0;
> remove "tobuffer", that's clear from the parameters. what's unclear is the 
Done


Line 126:     max_value_(max_value) {
> formatting
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 208:   // Conjuncts that can be evaluated against the statistics of parquet files. Each
> refer to thrift structs explicitly, where appropriate.
Done


Line 209:   // conjunct references the slot in 'min_max_tuple_id' with the same index.
> i didn't understand the last sentence.
Changed it.


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 143:   // List of conjuncts for min/max values, that are used to skip data when scanning
> "min/max values" is vague, refer to thrift structs explicitly.
Done


Line 145:   private List<Expr> minMaxConjunts_ = Lists.newArrayList();
> conjuncts
Done


Line 147:   // Tuple that is used to materialize statistics when scanning Parquet files.
> describe what's in it.
Done


Line 290:    * Builds a predicate to evaluate statistical values by combining 'inputSlot',
> make 'statistical values' concrete
Done


Line 291:    * 'inputPred' and 'op' into a new predicate, and adding it to 'minMaxConjunts_'.
> missing other side effects
Done


Line 297:     Preconditions.checkState(constExpr.isConstant());
> why not a literal?
My reasoning was that any constant expression could be evaluated against the statistics. I
added a test to make sure this does indeed work as expected.


Line 303:     // Make a new slot from the slot descriptor.
> what do you mean by 'slot'? (slot/slot descriptor are usually synonyms)
I meant slotRef, but the comment looked superfluous and I removed it altogether.


Line 315:    * TODO: This method currently doesn't de-dup slots that are referenced multiple
times.
> resolve
As discussed, lets keep this for now and address it in a subsequent change by simplifying
redundant exprs. I created IMPALA-4958 to track this.


Line 328:       // We only support slot refs on the left hand side of the predicate, a rewriting
> this overlaps with what kuduscannode does, and there's already a similar co
I had talked to Alex about this last week and he suggested to add a general rewrite rule in
the FE to normalize binary exprs. I removed the branches from BinaryPredicate.java that are
no longer needed after the normalization. I also cleaned up code in KuduScanNode.java that
is no longer needed.


Line 343:       {
> { on preceding line
Done


Line 658:       msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt());
> check that it's not invalid
How can I check this? The tupleId needs to come from the TupleId generator and there is no
ctor for a TupleDescriptor that leaves it uninitialized. Do you mean minMaxTuple_.getId()
!= null ?


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
File fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java:

Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule {
> add tests for rule
Done


Line 40:     if (!expr.isAnalyzed()) return expr;
> why do you need this?
I had copied this over, removed it.


http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

Line 1047:    statistics predicates: c_nationkey <= 16, c_nationkey >= 16
> i think it's better to retain the original predicate (you don't need to sho
I changed the code to track the original predicates and display them instead. I also changed
the explain level to "extended", based on it's description in the docs (below). While the
statistics here are different from column stats, I think it makes sense to keep "all statistics"
in the same level. However I don't feel strongly about this - let me know if you want me to
change it to verbose. On a related note, do you think we should change "statistics predicates"
to something like "parquet statistics predicates" to make it clearer that this is not column
statistics?

"Includes additional detail about how the query planner uses statistics in its decision-making
process, to understand how a query could be tuned by gathering statistics, using query hints,
adding or removing predicates, and so on."


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Matthew Mulder <mmulder@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message