impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5036: Parquet count star optimization
Date Tue, 09 May 2017 01:58:10 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6812/1//COMMIT_MSG
Commit Message:

PS1, Line 10: statistic
> How about "we use the Parquet field RowGroup.num_rows"?
Works for me.


http://gerrit.cloudera.org:8080/#/c/6812/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 440:       *dst_slot = file_metadata_.row_groups[row_group_idx_].num_rows;
> There's also FileMetaData::num_rows. Can't we use that instead of looping o
We could, but not sure it's worth it. One scanner does not necessarily process an entire Parquet
file, so we'd need to make sure that exactly one scanner thread deals with the entire file
just for this special case. Taras, maybe you can take a look and see how invasive that would
be?


Line 1455:     // Column readers are not needed because we are not reading from any columns
if this
> Can we then optimize something like 
The transformation is only valid if l_comment is non-nullable. We have no concept of nullability
for HDFS tables.


http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 34: |  output: sum_zero_if_empty(functional_parquet.alltypes.parquet-stats: num_rows)
> i don't know what this means.
Personally, I prefer to show what is actually being executed in the explain plan. Otherwise,
if something goes wrong it could be hard to debug because we do not know which code path it
is taking.

Do you have an alternative proposal for showing that the optimized path is being taken? How
would we debug/support/test this feature? How will users understand their the query plan?

Let's start a thread/doc about these usability/supportability issues.


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message