impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4840: Fix REFRESH performance regression.
Date Wed, 15 Feb 2017 02:34:13 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4840: Fix REFRESH performance regression.
......................................................................


Patch Set 1:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6009/1//COMMIT_MSG
Commit Message:

Line 20: listStatus() on the partition directory and checking the
Mention that listStatus() is significantly faster than listFiles(), that's the fundamental
reason why this optimization works.


Line 22: metadata load still fetches the block locations in bulk.
Mention the behavioral change of REFRESH with respect to HDFS block movements.


http://gerrit.cloudera.org:8080/#/c/6009/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 341:   private int computeFdBlockMetadata(FileDescriptor fd, BlockLocation[] locations)
setBlockMetadata() or function/arg names that more clearly indicate the inputs/outputs

We are basically setting the block metadata in 'fd' based on its block locations.


Line 364:   }
newline


Line 783:    * modified files on HDFS.
Mention which case this function is optimized for, e.g., it will be fast if there are few
changes, but it will be slower than listFiles() if there are many changes. While doing that
you can mention the implementation of listStatus() followed by getFileBlockLocations() and
the 40x perf difference between listStatus() and listFiles() that you measured.


Line 785:   private void loadMetadataAndDiskIds(HdfsPartition partition) throws CatalogException
{
refreshFileMetadata() or something that indicates we are reusing existing file metadata?


Line 804:       if (!FileSystemUtil.supportsStorageIds(fs)) {
can move this up (avoid indexing all files in the partition)


Line 830:       if (unknownDiskIdCount > 0 && LOG.isWarnEnabled()) {
Do we really need this warning? It has the potential of log spew because it's printed per-partition.
Can we maybe add this warning after a complete REFRESH?


Line 835:       numHdfsFiles_ += newFileDescs.size();
where do we subtract the old values?


-- 
To view, visit http://gerrit.cloudera.org:8080/6009
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I859b9fe93563ba886d0b5db6db42a14c88caada8
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message