impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taras Bobrovytsky (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5036: Parquet count star optimization
Date Fri, 30 Jun 2017 00:05:21 GMT
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6812/5/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

Line 886:         prefix + "8InitZeroIN10impala_udf9BigIntValEEEvPNS2_15FunctionContextEPT_",
> rather than defining a new builtin, why don't we wrap the sum() with a coal
I gave this a try, but ran into a problem. I tried substituting a count(*) with a zeroifnull(sum(parquet_num_rows)).
The aggregation node can only evaluate agg functions, and zeroifnull is not.

A potential way out is to modify the merge function so that the input to it is wrapped in
a zeroifnull(), but I think that's complicated. We would have make modifications in 2 places
if we go that route: replace count() with sum() and wrap the input to the merge function.
Simply adding a builtin seems cleaner to me.


http://gerrit.cloudera.org:8080/#/c/6812/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

PS5, Line 105:  The caller passes in a flag to the constructor that indicates if the count(*)
             :  * optimization can be applied to the query block of this scan. This scan node
decides
             :  * whether to apply the optimization or not
> This doesn't make a lot of sense to me. On one hand, it sayst he caller dec
Reworded slightly. (The main idea is that they are both involved in the decision).


Line 109:  *
> do we actually explain in some comment how this optimization works at a hig
This is the only one (I think it's pretty high level).


PS5, Line 140:  This
             :   // scan does additional analysis in init() to determine whether it is correct
to apply
             :   // the optimization.
> this seems to contradict the comment above that says the caller passes a fl
The optimization is applied if 2 things are true: the query block meets certain requirements
and the scan node meets certain requirements. This flag indicates if the query block meets
the requirements.


http://gerrit.cloudera.org:8080/#/c/6812/5/testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test:

Line 86: # Verify that 0 is returned when we are selecting from an empty table and the optimization
> I think it's sufficient to say:
Done


http://gerrit.cloudera.org:8080/#/c/6812/4/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

Line 280:     vector.get_value('exec_option')['batch_size'] = 1
> The core tests we run and the dimensions we use seem somewhat broken/scary 
Created IMPALA-5603.


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Zach Amsden <zamsden@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message