impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-2373: Extrapolate row counts for HDFS tables.
Date Tue, 16 May 2017 05:37:49 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables.

Patch Set 1:

File be/src/common/

PS1, Line 137: tables
> What does this mean? I thought we only extrapolate row count of partitions.
Removed the "table" part. I had it there because of unpartitioned tables for which we do extrapolate.
File be/src/util/

PS1, Line 36: DECLARE_bool
> nit: they seem to be grouped by type
File common/thrift/CatalogObjects.thrift:

PS1, Line 140: HDFS
> How about S3? Does it make sense to rename the total_hdfs_bytes into total_
1. We have this problem everywhere (e.g. HdfsScanNode, HdfsTable,, etc.).
These are really "FilesystemTables" or something like that.
2. It may be inaccurate, but consistent, i.e., I really mean this is only set for TTableType
3. Adding inconsistency seems worse to me. It may mislead readers to think it can be set for
HBase, Kudu and whatnot.

Renamed to total_file_bytes and modified comment.
File common/thrift/JniCatalog.thrift:

PS1, Line 493: !is_incremental
> nit: blending code in comments :) "is_incremental is false"

Line 494:   9: optional i64 total_hdfs_bytes
> why is this a parameter/an input of compute stats?
We need to pass this value from the impalad FE to the impala BE and then to the catalogd.
We currently set this value in ComputeStatsStmt.toThrift() which seems to me like the simplest

How else would we pass this value?
- invent a special Expr which we can put in the SQL of the table-stats child query
- compute it based on the table-stats child query query profile after the query completed
(profile counter currently does not exist, and is not easy to deduce, I checked)
- other ideas?
File fe/src/main/java/org/apache/impala/analysis/

PS1, Line 587: hdfsTable
> nit: inline
File fe/src/main/java/org/apache/impala/catalog/

PS1, Line 185: private long totalHdfsBytes_;
> Do we still need this?
Yes. This is the actual total number of bytes based on the current set of file descriptors.
We cannot replace it by the value stored in the stats just like we cannot replace a count(*)
query with the row count from the stats because the set of files may have changed since the
last stats computation.

PS1, Line 1328: tableStats_.num_rows
> getNumRows()?

PS1, Line 1912: getExtrapolatedNumRows(p.getSize())
> Why do we need to compute this if we already have p.getNumRows() and is >= 
We follow the same policy here as with computing the scan cardinality. We always extrapolate
because files could have been added/dropped from a partition since that partition's last stats
computation. It's a case we explicitly wanted to handle.

Added comment.

PS1, Line 1958: tableStats_.num_rows
> getNumRows()?
File fe/src/main/java/org/apache/impala/catalog/

PS1, Line 76: // estimated number of rows in table; -1: unknown.
> update comment

Line 492:     Preconditions.checkState(this instanceof HdfsTable);
> why have this function live here and not in hdfstable?
Good point. Done.
File fe/src/main/java/org/apache/impala/planner/

PS1, Line 616: computeStats(Analyzer analyzer) {
> The logic in this function is now very hard to follow. It's not clear when 
Added comments and spacing to separate logically different blocks. Hopefully, this helps clarify.
File testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test:

Line 114:    stats-rows=7300 extrapolated-rows=7300
> to reduce verbosity, print the extrapolated count only when it differs from
We're using up a line already, and I think it's easier for users to see that the two numbers
are the same as opposed to knowing that the code omits one of the numbers if it is he same
as the other one (explicit is easier to reason about vs. implied absence).
File testdata/workloads/functional-query/queries/QueryTest/alter-table.test:

> extrap is a bit weird, and we don't use abbreviations elsewhere here. spell
The reason why I did not spell it out is because it wastes a lot of space in the SHOW TABLE
STATS result table, but either way is fine with me.

So "EXTRAPOLATED #ROWS"? Want be be sure because it's very time consuming to change all these

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-HasComments: Yes

View raw message