impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.
Date Sun, 21 May 2017 15:47:10 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs.

Patch Set 4:

File fe/src/main/java/org/apache/impala/catalog/

Line 809:   public HdfsPartition cloneNewFds(List<FileDescriptor> newFds) {
i don't think this is used anymore.
File fe/src/main/java/org/apache/impala/catalog/

Line 1961: 
> I tried integrating the sampling into computeScanRangeLocations() but ran i
i'd say it's cleaner than partition cloning, because the way the cloning works is incomprehensible
outside of the context of file sampling.

but you're right, returning as your map value a list<integer> and then constructing
a list<filedescriptor> from it later is inferior to simply returning list<filedescriptor>
values here directly.
File fe/src/main/java/org/apache/impala/catalog/

Line 1940:    * The 'percentBytes' parameter must be between 0 and 100.
i'd also point out that it allocates something like 12+ bytes per existing (pre-sampled) file,
and that the goal was to minimize that amount. just so the next person doesn't come in and
"simplify" this by allocating larger structures instead of the two scalar arrays.

i'm just worried that this will end up generating millions of intermittent objects for very
large tables (where it'll clearly be used, and where the sampling percentage itself might
be fairly small, ie, the query itself would run quickly).

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-HasComments: Yes

View raw message