impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taras Bobrovytsky (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5036: Parquet count star optimization
Date Wed, 07 Jun 2017 21:18:16 GMT
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................


Patch Set 1:

(38 comments)

http://gerrit.cloudera.org:8080/#/c/6812/1//COMMIT_MSG
Commit Message:

PS1, Line 10: statistic
> Works for me.
Done


Line 13: 
> Mention the new rewrite rule
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 426:     // Populate the slot with Parquet num rows statistic.
> Populate the single slot with the Parquet num rows statistic.
Done


Line 431:     memset(tuple_buf, 0, tuple_buf_size);
> Don't think we usually do this, and I don't think it's necessary.
Done


Line 432:     while (!row_batch->AtCapacity()) {
> Do we have a suitable file with many row groups for testing this logic? We 
The file has to have more than 1 row group. I'm pretty sure tpch_parquet.lineitem should work.


Line 440:       *dst_slot = file_metadata_.row_groups[row_group_idx_].num_rows;
> We could, but not sure it's worth it. One scanner does not necessarily proc
I looked for a bit how to make sure it's one thread per file, and it looks complicated to
me. I also think it's not worth doing this because for all other optimizations (for example
where we look at the min and max statistic) we would have to loop over the row groups (and
not the file). I don't think there is a per file min and max statistic.


Line 455:     DCHECK_LE(row_group_rows_read_, file_metadata_.num_rows);
> What if file_metadata_.num_rows or file_metadata_.row_groups[row_group_idx_
Not sure, should we be checking that stats is non-negative? What do you suggest we do?


Line 488:     DCHECK(row_group_idx_ <= file_metadata_.row_groups.size());
> Why this change?
To keep it consistent with line 434. I don't like checking for greater or equal to. If it's
greater, doesn't that mean that something has gone wrong?


Line 1455:     // Column readers are not needed because we are not reading from any columns
if this
> The transformation is only valid if l_comment is non-nullable. We have no c
There can be several slots if we are grouping by a partition column. Left unmodified.


http://gerrit.cloudera.org:8080/#/c/6812/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 158:   const bool optimize_parquet_count_star() { return optimize_parquet_count_star_;
}
> const function
Done


Line 373:   bool optimize_parquet_count_star_;
> const
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 585:     if (copiedInputExprs.size() != inputExprs.size()) return;
> Have you tried also substituting the FunctionCallExpr.mergeAggInputFn_ in A
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
File fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java:

Line 331:    * Return true if the slots being materialized are all partition columns.
> The new behavior is tricky to reason about, can we simplify it?
Done


Line 335:     if (materializedSlots.isEmpty()) return true;
> I think this might break the behavior of OPTIMIZE_PARTITION_KEY_SCANS=true 
Reverted to the old behavior


http://gerrit.cloudera.org:8080/#/c/6812/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 131:   // Set to true when this scan node can optimize a count(*) query by populating
the
> Sentence is a little misleading because this flag is initialized to true if
Done


Line 136:   protected ExprSubstitutionMap aggSmap_;
> // Should be applied to the AggregateInfo from the same query block.
Done


Line 243:       if (optimizeParquetCountStar_ && fileFormats.size() == 1 &&
conjuncts_.isEmpty() &&
> Create a helper function for this optimization, and describe in its functio
Done


Line 245:         // Create two functions that we will put into an smap. We want to replace
the
> Preconditions.checkState(desc_.getPath().destTable() == null);
Done. The first condition should be != null


Line 246:         // count function with a sum function.
> count() and sum()
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

Line 213:    * @param limit
> can just remove this
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

Line 559:   private boolean checkOptimizeCountStar(SelectStmt selectStmt) {
> Let's move this logic into AggregateInfo.
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/fe/src/main/java/org/apache/impala/rewrite/CountConstantRule.java
File fe/src/main/java/org/apache/impala/rewrite/CountConstantRule.java:

Line 28:  * Replaces count(CONSTANT) with an aquivalent count{*}.
> Replaces count(<literal>) with an equivalent count(*).
Done


Line 32:  * count(2017) --> count(*)
> add count(null) -> count(null)
Done


Line 34: public class CountConstantRule implements ExprRewriteRule {
> How about NormalizeCountStarRule
Done


Line 45:     Expr e = origExpr.getChild(0);
> e -> child
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 1: # IMPALA-5036
> The JIRA is not very useful. Instead, describe what this test is covering.
Got rid of the JIRA everywhere.
Added a rewrite test.


Line 3: UNION ALL
> I prefer consistent lower case or upper case, e.g., union all, or make the 
Done


Line 22: |  |  output: sum_zero_if_empty(functional_parquet.alltypes.parquet-stats: num_rows)
> agg expr seems long, I'm thinking we can omit the "functional_parquet.allty
Not quite sure how to do that. We set the SlotDescriptor label and then it just gets printed
here. Are you thinking of adding some kind of a special case?


Line 28: |  |  output: sum_zero_if_empty(functional_parquet.alltypes.parquet-stats: num_rows)
> What do you think of "sum_init_zero()"? It's a little shorter and seems a l
Done


Line 34: |  output: sum_zero_if_empty(functional_parquet.alltypes.parquet-stats: num_rows)
> sum_zero_if_empty sounds pretty obscure, what is that supposed to mean?
Renamed to sum_init_zero()


Line 112: # IMPALA-5036: group by partitioned columns.
> partition columns
Done


Line 141: # IMPALA-5036: The optimization is disabled because tinyint_col is not a partitioned
col.
> partition col
Done


Line 143: ---- PLAN
> add a negative case with a different agg function, something like:
Done


Line 377: # IMPALA-5036
> Also add a test like this:
Done


http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test:

Line 314: WARNING: The following tables are missing relevant table and/or column statistics.
> These seemed to have stats before? Something weird in your setup?
I don't think this has anything to do with my setup. It started to think that the tables are
missing statistics after this patch for some reason. I didn't get to the bottom of it yet.


http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

Line 282: ---- QUERY
> Similar to my comment on the commit message, this file tests correct handli
Done


Line 285: from functional_parquet.alltypes
> I think we need at least one focused test on a file with many row groups an
Added a test on tpch_parquet.lineitem.


Line 292: # IMPALA-5036
> Same here, JIRA number not very useful, prefer textual description of test
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message