impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
Date Tue, 16 May 2017 10:56:52 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.

Patch Set 3:

File fe/src/main/java/org/apache/impala/analysis/

Line 68:     if (randomSeed_ != null) {
single line?
File fe/src/main/java/org/apache/impala/catalog/

Line 812:     result.numRows_ = numRows_;
this is a bit weird outside of the context of sampling (why should #rows stay the same for
a different set of files?).

instead of doing the sampling in the metadata, would it make more sense to do it during scan
range computation?
File fe/src/main/java/org/apache/impala/catalog/

Line 1961:     List<Pair<Long, FileDescriptor>> allFiles =
this is an expensive list, and i don't think you need it. you can achieve the same effect
with a list that mirrors orderedparts and contains the cumulative # of files up to the preceding
partition (e.g., {0, 517, 1120, ...}). you then generate a random sequence of numbers between
0 and #totalfiles-1. this requires space proportional to the sample size, but not the total
number of files.

you could even return this as a map<partitionid, set<int>> (the value being the
indices into the per-partition list of file descriptors), and have hdfsscannode use that to
access the sampled files in the original partitions.
File fe/src/main/java/org/apache/impala/planner/

Line 132:   // Total number of files from the effective partitions (partitions_ or sample_).
update comment

Line 135:   // Total number of bytes from the effective partitions (partitions_ or sample_).

Line 678:           cardinality_ = addCardinalities(cardinality_, p.getNumRows());
shouldn't the cardinality reflect the reduction from sampling? if you're sampling, you reduce
scan output, which should affect join order, no?
File tests/query_test/

Line 47:     # 2. The results of queries without a repeatable clause could be change due to
remove "be"

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-HasComments: Yes

View raw message