impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5036: Parquet count star optimization
Date Mon, 12 Jun 2017 20:30:49 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................


Patch Set 3:

(26 comments)

http://gerrit.cloudera.org:8080/#/c/6812/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 452:   }
Why not else if as in the previous patch set? Else-if seems more accurate.


http://gerrit.cloudera.org:8080/#/c/6812/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 226:   11: optional i64 parquet_count_star_slot_offset
i32 right?


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 571:   public FunctionCallExpr getMergeAggInputFn() { return mergeAggInputFn_; }
> Yes, they are used on line 617, for example. I don't think it would be a go
I think you can access members directly in L617


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 248:    * Adds a new slot descriptor to the tuple descriptor of this scan. Also adds
an entry
* explain what is going to be stored in this new slot descriptor
* explain what is returned


Line 249:    * to 'optimizedAggSmap_' that replaces a count() with a special sum() function
that
that substitutes count(*) with sum_init_zero(<new-slotref>)


Line 915:     msg.hdfs_scan_node.setOptimize_parquet_count_star(optimizedAggSmap_ != null);
Do we need to pass this to the BE? The presence/absence of the parquet_count_star_slot_offset
seems enough to decide.


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

Line 1213:    * table scans.
instead of scanning the table (fix other places below also)


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

Line 290:   public void testParquetStats() { runPlannerTestFile("parquet-stats-agg"); }
testParquetStatsAgg()


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 1: # Verify that that the parquet count(*) optimization is applied in all the cases.
spell out "in all the cases" a little more and also mention that in one case we expect the
optimization to not be applied


Line 22: |  |  output: sum_init_zero(functional_parquet.alltypes.parquet-stats: num_rows)
Can we reduce this to just parquet-stats.num_rows? How do we create such a long label?


Line 99: ---- DISTRIBUTEDPLAN
Remove here and all tests below. I think showing the distributed plan for the first test case
is enough.


Line 114: select month, count(*) from functional_parquet.alltypes group by month, year
Add a negative test for this one:

select count(*), count(year), count(month),


Line 172: select max(year), count(*) from functional_parquet.alltypes
use avg() instead of max() because max() is going to be optimized in the same way in the future
(but avg() definitely not)


Line 195: # IMPALA-5036
JIRA number is not very descriptive. Describe what this test case is checking.


Line 278: # The count(*) optimization is applied to the inline view even if there is a join.
Add a negative test case that shows the query block must have one table ref, e.g.

select count(*) from functional_parquet.alltypes a, functional_parquet.alltypes b


Line 352: # tinyint_col is not partitioned so the optimization is disabled.
tinyint_col is not a partition column


Line 402: # Optimization is not applied in the case of count(null).
is not applied to count(null)


Line 451: # Optimization is not applied because the count(*) is not applied directly to the
Optimization is not applied across query blocks, even though it would be correct here.


Line 453: select count(*) from ( select int_col from functional_parquet.alltypes) t
Add a new test that shows we only consider materialized agg exprs, something like:

select year, cnt from (
  select year, count(bigint_col), count(*) cnt, avg(int_col)
  from functional_parquet.alltypes
  where month=1
  group by year
)


Line 476: # Optimization is not applied because we are not scanning a Parquet table.
Remove. This case is already covered above.


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test:

Line 324: WARNING: The following tables are missing relevant table and/or column statistics.
Something wrong with your setup? This table should have stats in our dev setup.


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-query/queries/QueryTest/aggregation.test
File testdata/workloads/functional-query/queries/QueryTest/aggregation.test:

Line 1259: # IMPALA-5036: Tests the correctness of the Parquet count(*) optimization.
Let's move these into a new .test file. Also no need to prefix IMPALA-5036 everywhere (that's
mostly for regression tests, but this is a new feature)


Line 1279: from functional_parquet.alltypes where month > 10 group by year, month
want to remove the predicate here (that case is explicitly called out below)


Line 1303: # IMPALA-5036: Parquet count(*) optimization with the result of the going into
a join.
some extra words


Line 1316: select 1 from functional_parquet.alltypes having count(*) > 1
Add a count(*) test against an empty table and a table where we filtered all partition columns
to make sure that 0 is returned.


http://gerrit.cloudera.org:8080/#/c/6812/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

Line 275:     exec_option = vector.get_value('exec_option')
Explain what this test is covering. Also would we get coverage of this automatically if we
moved this test into the existing .test file (where your other tests live)? In exhaustive
I believe we do run with different batch sizes including 1.

Did you confirm that these files have multiple row groups?

Might also be good to run a count(*) on these tables:
functional_parquet.lineitem_multiblock_one_row_group
functional_parquet.lineitem_sixblocks
functional_parquet.lineitem_multiblock

(You can take a why other tests use them)


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message