impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2373: Extrapolate row counts for HDFS tables.
Date Tue, 23 May 2017 06:36:45 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables.
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/catalog/View.java
File fe/src/main/java/org/apache/impala/catalog/View.java:

PS3, Line 118: tableStats_ = new TTableStats(-1);
             :       tableStats_.setTotal_file_bytes(-1);
> I believe you don't need it now that you put the initialization in Table's 
Correct. Done.


http://gerrit.cloudera.org:8080/#/c/6840/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

PS3, Line 617: computeStats(Analyzer analyzer) {
> Significantly better. Thanks
Thanks for the suggestion!


PS3, Line 655: statsNumRows_, totalBytes_
> There are class members, why are you passing them as parameter? Is this fun
I find params easier to reason about than member vars. Removed the params.


PS3, Line 673: are ignored
> "because...". You may want to briefly mention why this is the case.
Done


PS3, Line 683: extrapolatedNumRows_ = tbl_.getExtrapolatedNumRows(totalBytes);
> This doesn't necessarily belong in this function. I'd move it to L654 since
Done. I think this means I need to move the comment around extrapolation to computeStats().
Did that.


http://gerrit.cloudera.org:8080/#/c/6840/3/tests/metadata/test_explain.py
File tests/metadata/test_explain.py:

PS3, Line 127: 50
> Hm, I would expect this to be '100' since the query doesn't filter any part
This specific test case may seem clear cut, but I don't think the general issue is that simple.

I'm not sure it is a good idea to special case the "all partitions selected" case because
I think it makes the behavior harder to reason about for users and for us. Think about the
following scenarios:

1. Select all partitions; all partitions have stats
2. Select all partitions; some partitions have stats
3. Select all partitions; no partitions have stats
4. Select some partitions; all selected partitions have stats
5. Select some partitions; some selected partitions have stats
6. Select some partitions; no selected partitions have stats

In the current code there are two cases: Always use the partition stats and only fall back
to table stats when no partition stats are available. That behavior is easy to understand
and explain.

The proposed change might break existing workflows around manually setting stats. For example,
if someone has a workflow that involves setting only the per-partition row counts (but not
the table row counts), then using the table-level stats may break them. You would basically
always need to update both the partition and table-level stats or else risk inconsistent planning
behavior depending on how many partitions you are selecting.

We can continue thinking about this change, but if we decide to change anything I think it
should be done as part of a separate JIRA and documented very carefully.


PS3, Line 131:   
> nit: extra space
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message