impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Date Sun, 19 Feb 2017 19:19:07 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics

Patch Set 4:

File be/src/exec/

Line 472:   ScopedBuffer tuple_buffer(scan_node_->mem_tracker());
no need to allocate this and go through the setup for every single row group.

Line 486:   conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size());
there's too much memory allocation going on for what are essentially static structures.

Line 496:     unique_ptr<ColumnStatsBase> stats(
there's no need for memory allocation here

Line 501:     if (e->function_name() == "lt" || e->function_name() == "le") {
introduce named constants (in the appropriate class)
File be/src/exec/hdfs-parquet-scanner.h:

Line 378:   /// Min/max statistics contexts.
who owns them?
File be/src/exec/

Line 182:   if (min_max_tuple_id_ > -1) {
">" or < don't make sense for tuple ids
File be/src/exec/parquet-column-stats.h:

Line 40: /// writing parquet files. It provides an interface to populate a parquet::Statistics
this description isn't true anymore, it's now also an intermediate staging area for the min/max

i'm not sure this expanded abstraction makes sense (why do you need an intermediate staging

Line 70:   /// 'nullptr' for unsupported types. The caller is responsible for freeing the
this should be created in an objectpool

Line 87:   virtual void WriteMinToBuffer(void* buffer) const = 0;
remove "tobuffer", that's clear from the parameters. what's unclear is the format of the written
data. is the void* simply a slot? 'buffer' is too generic.

Line 126:     max_value_(max_value) {

it looks like you're populating an object of this class and then write out the min/max to
the corresponding slots. why not go directly from parquet::statistics to the slots?
File common/thrift/PlanNodes.thrift:

Line 208:   // Conjuncts that can be evaluated against the statistics of parquet files. Each
refer to thrift structs explicitly, where appropriate.

Line 209:   // conjunct references the slot in 'min_max_tuple_id' with the same index.
i didn't understand the last sentence.
File fe/src/main/java/org/apache/impala/planner/

Line 143:   // List of conjuncts for min/max values, that are used to skip data when scanning
"min/max values" is vague, refer to thrift structs explicitly.

Line 145:   private List<Expr> minMaxConjunts_ = Lists.newArrayList();

Line 147:   // Tuple that is used to materialize statistics when scanning Parquet files.
describe what's in it.

i should have an understanding of the overall approach just based on reading the comments
for the related data structures. this might best fit into the class header.

Line 290:    * Builds a predicate to evaluate statistical values by combining 'inputSlot',
make 'statistical values' concrete

Line 291:    * 'inputPred' and 'op' into a new predicate, and adding it to 'minMaxConjunts_'.
missing other side effects

Line 297:     Preconditions.checkState(constExpr.isConstant());
why not a literal?

Line 303:     // Make a new slot from the slot descriptor.
what do you mean by 'slot'? (slot/slot descriptor are usually synonyms)

Line 315:    * TODO: This method currently doesn't de-dup slots that are referenced multiple

Line 328:       // We only support slot refs on the left hand side of the predicate, a rewriting
this overlaps with what kuduscannode does, and there's already a similar concept in the form
of binarypredicate.getboundslot. figure out the similarity (might require a new concept) and
move the functionality into binarypredicate.

Line 343:       {
{ on preceding line

Line 658:       msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt());
check that it's not invalid
File fe/src/main/java/org/apache/impala/rewrite/

Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule {
add tests for rule

Line 40:     if (!expr.isAnalyzed()) return expr;
why do you need this?
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

Line 1047:    statistics predicates: c_nationkey <= 16, c_nationkey >= 16
i think it's better to retain the original predicate (you don't need to show me here that
you know how to transform it correctly against the min/max tuple).

also, this should only be printed for verbose explain output.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-HasComments: Yes

View raw message