impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
Date Mon, 28 Nov 2016 23:27:59 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Patch Set 4:

File fe/src/main/java/org/apache/impala/catalog/

Line 37:     private static DiskIdMapper diskIdMapper = null;
why not:

public static DiskIdMapper INSTANCE = new DiskIdMapper()?

Line 50:     private static HashMap<String, Integer> storageIdToInt = Maps.newHashMap();
make members and methods non-static since we are using a singleton

Line 63:     public static int getDiskId(String host, String storageId) {

Line 72:       if (storageIdToInt.containsKey(storageId)) {
I did some more reading and I'm afraid this "double checked locking" scheme does not work
correctly in Java. The readers could return a ConcurrentModificationException or even corrupted

See for example:

I think we should make storageIdGenerator a ConcurrentHashMap to address the issue.

You can avoid doing two lookups here by doing get() and then checking for null.
File fe/src/main/java/org/apache/impala/catalog/

Line 120: import;
I think these might need cleanup.

Line 351:       Map<String, Map<String, FileDescriptor>> newFileDescMap = Maps.newHashMap();
What's the benefit of populating this temporary map and then merging it into the global one?
Why does updating the global one directly not work?

Line 382:           // Loop over all blocks in the file.
comment doesn't add anything, remove

Line 384:             Preconditions.checkNotNull(loc);
Not your change, but this looks weird, why would it be null? Remove?

Line 386:             String[] blockHostPorts = loc.getNames();
Let's just use loc.getNames() and loc.getHosts() inside the loop in L403. I don't see any
benefit to having these as separate variables and their surrounding comments, do you?

Line 395:                     Sets.newHashSet(Arrays.asList(loc.getCachedHosts()));
I think we can get rid of the asList() here

Line 399:             // to hostMap_/hostList_. The host ID (index in to the hostList_) for
update comment, "hostMap_" doesn't seem to exist

Line 413:           // Remember the THdfsFileBlocks and corresponding BlockLocations.  Once
all the
is this comment still accurate?

Line 423:       for (List<HdfsPartition> partLists: filteredParts.values()) {
Seems cheaper to loop on newFileDescMap on look up filteredParts

Line 431:           numHdfsFiles_ += fileDescs.size();
Why do the number of files and bytes always keep increasing?

Line 458:                 fileStatus.getModificationTime());
indent 4 (and same issue elsewhere)

Line 468:         // All the partitions corresponding to the same partition path, have the
I don't think this is necessarily true, but you can state that we assume it is so.

Line 483:         for (HdfsPartition partition: partitions) {
This function looks much more expensive than before. We loop over all partition paths in L451
and here we loop over all partitions, so that's O(N^2) in the number of partitions.

Line 780:     Map<FsKey, FileBlocksInfo> blocksToLoad = Maps.newHashMap();

Line 784:     HashMap<Path, List<HdfsPartition>> filteredParts = Maps.newHashMap();
The var name filteredParts is a little confusing to me. How about partsByPath or something
like that. The "filtered" part is a little unclear.

Line 830:         if (filteredParts.containsKey(partDir)) filteredParts.get(partDir).add(partition);
use {}

use get() and check for null to avoid multiple lookups

Line 833:         // Check if the partition directory's is a child dir of hdfsBaseDir_
typo: directory's -> directory

Line 835:             !FileSystemUtil.isChildPath(partDir, getHdfsBaseDirPath())) {
use tblLocation, that's clearer (and in comment above)

Line 836:           // Add the partition to the list of paths to load block MD.
Comment only describes the code below, maybe say something like:

// This partition has a custom filesystem location. Load its file/block metadata separately
by adding it to the list of dirs to load.

Line 868:       loadBlockMetadata(fs, location, filteredParts, fileBlocksToLoad);
Having separate loadBlockMetadata() and loadDiskIds() doesn't seem to make much sense. We're
spending a lot of effort to populate the temporary fileBlocksToLoad map just so we can maintain
the separation (which does not seem to make sense).

Is there a reason why we could not want to load the block md and disk ids at the same time
(without this tmp map)?

Line 934:     boolean isMarkedCached = isMarkedCached_;
looks unused, are you sure nothing got broken here?

Line 953:       try {
not your change, but you can do this instead:

for (Expr v: keyValues) v.analyzeNoThrow(null);

Line 1583:     boolean isMarkedCached = isMarkedCached_;
looks unused, sure nothing is broken here?

Line 1628:         totalHdfsBytes_ += hdfsPart.getSize();
Why can't this stay inside addPartition()? In the loading case the size and number of file
descriptors should be 0, so adding that should be harmless.
File fe/src/main/java/org/apache/impala/common/

Line 423:     if (p.isRoot()) return false;
isn't the equals() below sufficient?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-HasComments: Yes

View raw message