From "Taras Bobrovytsky (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5036: Parquet count star optimization
Date Thu, 15 Jun 2017 01:01:45 GMT
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization

Patch Set 3:

File be/src/exec/

Line 445:       *dst_slot = file_metadata_.row_groups[row_group_idx_].num_rows;
> Bounds check against file_metadata_.num_rows (i.e. keep a running counter a

Line 452:   }
> Why not else if as in the previous patch set? Else-if seems more accurate.
Reverted to else if. (I don't think it matters if we have else if or not, the behavior is
identical in both cases)

Line 454:   if (scan_node_->IsZeroSlotTableScan()) {
> Why is this optimization not redundant now?  Maybe update the comment to in
File common/thrift/PlanNodes.thrift:

Line 226:   11: optional i64 parquet_count_star_slot_offset
> Would it be simpler to have this be one parameter and indicate truth by pas
Yes, I did something similar. (its now true is if this parameter is set).

Line 226:   11: optional i64 parquet_count_star_slot_offset
> i32 right?
Ah yes, because it's int instead of long in Java. Done
File fe/src/main/java/org/apache/impala/planner/

Line 248:    * Adds a new slot descriptor to the tuple descriptor of this scan. Also adds
an entry
> * explain what is going to be stored in this new slot descriptor

Line 249:    * to 'optimizedAggSmap_' that replaces a count() with a special sum() function
> that substitutes count(*) with sum_init_zero(<new-slotref>)

Line 915:     msg.hdfs_scan_node.setOptimize_parquet_count_star(optimizedAggSmap_ != null);
> Do we need to pass this to the BE? The presence/absence of the parquet_coun
File fe/src/main/java/org/apache/impala/planner/

Line 1213:    * table scans.
> instead of scanning the table (fix other places below also)
File fe/src/test/java/org/apache/impala/planner/

Line 290:   public void testParquetStats() { runPlannerTestFile("parquet-stats-agg"); }
> testParquetStatsAgg()
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 1: # Verify that that the parquet count(*) optimization is applied in all the cases.
> spell out "in all the cases" a little more and also mention that in one cas

Line 22: |  |  output: sum_init_zero(functional_parquet.alltypes.parquet-stats: num_rows)
> Can we reduce this to just parquet-stats.num_rows? How do we create such a 
The slot descriptor label gets printed here that is set on line 263 in
The full path is printed by default. Are you suggesting to add some kind of extra plumbing
how labels get printed?

> Remove here and all tests below. I think showing the distributed plan for t

Line 114: select month, count(*) from functional_parquet.alltypes group by month, year
> Add a negative test for this one:
Added a select count(year) from alltypes.

Line 172: select max(year), count(*) from functional_parquet.alltypes
> use avg() instead of max() because max() is going to be optimized in the sa

Line 195: # IMPALA-5036
> JIRA number is not very descriptive. Describe what this test case is checki
Rewrote. Still feels like the description is not quite right.

Line 278: # The count(*) optimization is applied to the inline view even if there is a join.
> Add a negative test case that shows the query block must have one table ref

Line 352: # tinyint_col is not partitioned so the optimization is disabled.
> tinyint_col is not a partition column

Line 402: # Optimization is not applied in the case of count(null).
> is not applied to count(null)

Line 451: # Optimization is not applied because the count(*) is not applied directly to the
> Optimization is not applied across query blocks, even though it would be co

Line 453: select count(*) from ( select int_col from functional_parquet.alltypes) t
> Add a new test that shows we only consider materialized agg exprs, somethin

Line 476: # Optimization is not applied because we are not scanning a Parquet table.
> Remove. This case is already covered above.
File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test:

Line 324: WARNING: The following tables are missing relevant table and/or column statistics.
> Something wrong with your setup? This table should have stats in our dev se
This has nothing to do with my setup. It passes a private build on Jenkins. I think this is
here because the way the table is scanned is different. I'll investigate some more what's
going on.
File testdata/workloads/functional-query/queries/QueryTest/aggregation.test:

Line 1259: # IMPALA-5036: Tests the correctness of the Parquet count(*) optimization.
> Let's move these into a new .test file. Also no need to prefix IMPALA-5036 

Line 1279: from functional_parquet.alltypes where month > 10 group by year, month
> want to remove the predicate here (that case is explicitly called out below

Line 1303: # IMPALA-5036: Parquet count(*) optimization with the result of the going into
a join.
> some extra words

Line 1316: select 1 from functional_parquet.alltypes having count(*) > 1
> Add a count(*) test against an empty table and a table where we filtered al
File tests/query_test/

Line 275:     exec_option = vector.get_value('exec_option')
> Explain what this test is covering. Also would we get coverage of this auto
Moved the test.

Yes I did confirm that these files have multiple row groups. Added running the test on the
tables you suggested. (Those also have many row groups).

