impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
Date Tue, 16 May 2017 10:56:52 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java
File fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java:

Line 68:     if (randomSeed_ != null) {
single line?


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

Line 812:     result.numRows_ = numRows_;
this is a bit weird outside of the context of sampling (why should #rows stay the same for
a different set of files?).

instead of doing the sampling in the metadata, would it make more sense to do it during scan
range computation?


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 1961:     List<Pair<Long, FileDescriptor>> allFiles =
this is an expensive list, and i don't think you need it. you can achieve the same effect
with a list that mirrors orderedparts and contains the cumulative # of files up to the preceding
partition (e.g., {0, 517, 1120, ...}). you then generate a random sequence of numbers between
0 and #totalfiles-1. this requires space proportional to the sample size, but not the total
number of files.

you could even return this as a map<partitionid, set<int>> (the value being the
indices into the per-partition list of file descriptors), and have hdfsscannode use that to
access the sampled files in the original partitions.


http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 132:   // Total number of files from the effective partitions (partitions_ or sample_).
update comment


Line 135:   // Total number of bytes from the effective partitions (partitions_ or sample_).
same


Line 678:           cardinality_ = addCardinalities(cardinality_, p.getNumRows());
shouldn't the cardinality reflect the reduction from sampling? if you're sampling, you reduce
scan output, which should affect join order, no?


http://gerrit.cloudera.org:8080/#/c/6868/3/tests/query_test/test_tablesample.py
File tests/query_test/test_tablesample.py:

Line 47:     # 2. The results of queries without a repeatable clause could be change due to
remove "be"


-- 
To view, visit http://gerrit.cloudera.org:8080/6868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message