impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
Date Mon, 28 Nov 2016 23:27:59 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
......................................................................


Patch Set 4:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 37:     private static DiskIdMapper diskIdMapper = null;
why not:

public static DiskIdMapper INSTANCE = new DiskIdMapper()?


Line 50:     private static HashMap<String, Integer> storageIdToInt = Maps.newHashMap();
make members and methods non-static since we are using a singleton


Line 63:     public static int getDiskId(String host, String storageId) {
storageUuid


Line 72:       if (storageIdToInt.containsKey(storageId)) {
I did some more reading and I'm afraid this "double checked locking" scheme does not work
correctly in Java. The readers could return a ConcurrentModificationException or even corrupted
data.

See for example:
http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

I think we should make storageIdGenerator a ConcurrentHashMap to address the issue.


You can avoid doing two lookups here by doing get() and then checking for null.


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 120: import java.io.IOException;
I think these might need cleanup.


Line 351:       Map<String, Map<String, FileDescriptor>> newFileDescMap = Maps.newHashMap();
What's the benefit of populating this temporary map and then merging it into the global one?
Why does updating the global one directly not work?


Line 382:           // Loop over all blocks in the file.
comment doesn't add anything, remove


Line 384:             Preconditions.checkNotNull(loc);
Not your change, but this looks weird, why would it be null? Remove?


Line 386:             String[] blockHostPorts = loc.getNames();
Let's just use loc.getNames() and loc.getHosts() inside the loop in L403. I don't see any
benefit to having these as separate variables and their surrounding comments, do you?


Line 395:                     Sets.newHashSet(Arrays.asList(loc.getCachedHosts()));
I think we can get rid of the asList() here


Line 399:             // to hostMap_/hostList_. The host ID (index in to the hostList_) for
each
update comment, "hostMap_" doesn't seem to exist


Line 413:           // Remember the THdfsFileBlocks and corresponding BlockLocations.  Once
all the
is this comment still accurate?


Line 423:       for (List<HdfsPartition> partLists: filteredParts.values()) {
Seems cheaper to loop on newFileDescMap on look up filteredParts


Line 431:           numHdfsFiles_ += fileDescs.size();
Why do the number of files and bytes always keep increasing?


Line 458:                 fileStatus.getModificationTime());
indent 4 (and same issue elsewhere)


Line 468:         // All the partitions corresponding to the same partition path, have the
same
I don't think this is necessarily true, but you can state that we assume it is so.


Line 483:         for (HdfsPartition partition: partitions) {
This function looks much more expensive than before. We loop over all partition paths in L451
and here we loop over all partitions, so that's O(N^2) in the number of partitions.


Line 780:     Map<FsKey, FileBlocksInfo> blocksToLoad = Maps.newHashMap();
unused?


Line 784:     HashMap<Path, List<HdfsPartition>> filteredParts = Maps.newHashMap();
The var name filteredParts is a little confusing to me. How about partsByPath or something
like that. The "filtered" part is a little unclear.


Line 830:         if (filteredParts.containsKey(partDir)) filteredParts.get(partDir).add(partition);
use {}

use get() and check for null to avoid multiple lookups


Line 833:         // Check if the partition directory's is a child dir of hdfsBaseDir_
typo: directory's -> directory


Line 835:             !FileSystemUtil.isChildPath(partDir, getHdfsBaseDirPath())) {
use tblLocation, that's clearer (and in comment above)


Line 836:           // Add the partition to the list of paths to load block MD.
Comment only describes the code below, maybe say something like:

// This partition has a custom filesystem location. Load its file/block metadata separately
by adding it to the list of dirs to load.


Line 868:       loadBlockMetadata(fs, location, filteredParts, fileBlocksToLoad);
Having separate loadBlockMetadata() and loadDiskIds() doesn't seem to make much sense. We're
spending a lot of effort to populate the temporary fileBlocksToLoad map just so we can maintain
the separation (which does not seem to make sense).

Is there a reason why we could not want to load the block md and disk ids at the same time
(without this tmp map)?


Line 934:     boolean isMarkedCached = isMarkedCached_;
looks unused, are you sure nothing got broken here?


Line 953:       try {
not your change, but you can do this instead:

for (Expr v: keyValues) v.analyzeNoThrow(null);


Line 1583:     boolean isMarkedCached = isMarkedCached_;
looks unused, sure nothing is broken here?


Line 1628:         totalHdfsBytes_ += hdfsPart.getSize();
Why can't this stay inside addPartition()? In the loading case the size and number of file
descriptors should be 0, so adding that should be harmless.


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 423:     if (p.isRoot()) return false;
isn't the equals() below sufficient?


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message