impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bharath Vissapragada (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
Date Tue, 29 Nov 2016 04:34:55 GMT
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
......................................................................


Patch Set 6:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 37: 
> why not:
Done


Line 50:     // the storage ID of a particular disk is unique across all the nodes in the
cluster.
> make members and methods non-static since we are using a singleton
Done


Line 63:      * the initial metadata load. Figure out ways to fix it using finer locking scheme.
> storageUuid
Done


Line 72:       // amount of data loading is done and storageIdtoInt is populated with storage
IDs
> I did some more reading and I'm afraid this "double checked locking" scheme
Done


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 120: import java.io.IOException;
> I think these might need cleanup.
My editor is messing up these imports. Let me try fixing them by hand.


Line 351:       Map<String, Map<String, FileDescriptor>> newFileDescMap = Maps.newHashMap();
> What's the benefit of populating this temporary map and then merging it int
I did that initially but the problem happens when one of the files is deleted. For example,
consider the following sequence of actions.

- create partition p1 with file f1
- perPartitionFileDescMap_ contains f1
- insert overwrite p1 select * from random_table, say this overwrites f1 and creates f2.
- In such a case, we loop p1, and add f2 to perPartitionFileDescMap_ but we still have f1
at the end of it.
- So we need to scan all entries in perPartitionFileDescMap_ and figure out the difference
with listStatus() and remove them. This is a costly operation.
- Easier way is to build a newFdMap with the current snapshot of listStatus() and replace
it in perPartitionFileDescMap_

Also this way I felt the code is more readable. This is arguable, but I felt so.


Line 382:           // Loop over all blocks in the file.
> comment doesn't add anything, remove
Done


Line 384:             Preconditions.checkNotNull(loc);
> Not your change, but this looks weird, why would it be null? Remove?
Done


Line 386:             String[] blockHostPorts = loc.getNames();
> Let's just use loc.getNames() and loc.getHosts() inside the loop in L403. I
We are using them in preconditions check in L396 and loop condition in L403 and L405,L408
for index based access. Let me know if you still think we should remove it.


Line 395:                     Sets.newHashSet(Arrays.asList(loc.getCachedHosts()));
> I think we can get rid of the asList() here
nice catch. done


Line 399:             // to hostMap_/hostList_. The host ID (index in to the hostList_) for
each
> update comment, "hostMap_" doesn't seem to exist
Remnants of old clean-up. This logic is now merged into hostIndex_. Updated comment accordingly.


Line 413:           // Remember the THdfsFileBlocks and corresponding BlockLocations.  Once
all the
> is this comment still accurate?
Yes, we basically add all these blocks to perFsFileBlocks and run getDiskId() on it in bulk.


Line 431:           numHdfsFiles_ += fileDescs.size();
> Why do the number of files and bytes always keep increasing?
I think its due to the current design. The reload is implemented as dropPartition() + creatPartition().
Also we resetPartitions() at few places to make it 0. Basically we never delete entries from
it, we just overwrite them with new values.


Line 458:                 fileStatus.getModificationTime());
> indent 4 (and same issue elsewhere)
Fixed my editor settings now. Done.


Line 468:         // All the partitions corresponding to the same partition path, have the
same
> I don't think this is necessarily true, but you can state that we assume it
Ok, modified the comment.


Line 483:         for (HdfsPartition partition: partitions) {
> This function looks much more expensive than before. We loop over all parti
Although it looks like O(N^2), it isn't. 'partitions' correspond to each partition that maps
to a single location. So if we flatten List<List<Partition>> its actually all
the partitions in the table. Basically we don't loop like a conventional nested loop "for
each partition, for each fileStatus". We loop through the pruned partition list i.e, partitions
corresponding to that path.


Line 780:     Map<FsKey, FileBlocksInfo> blocksToLoad = Maps.newHashMap();
> unused?
Done


Line 784:     HashMap<Path, List<HdfsPartition>> filteredParts = Maps.newHashMap();
> The var name filteredParts is a little confusing to me. How about partsByPa
Done


Line 830:         if (filteredParts.containsKey(partDir)) filteredParts.get(partDir).add(partition);
> use {}
Done


Line 833:         // Check if the partition directory's is a child dir of hdfsBaseDir_
> typo: directory's -> directory
Done


Line 835:             !FileSystemUtil.isChildPath(partDir, getHdfsBaseDirPath())) {
> use tblLocation, that's clearer (and in comment above)
Done


Line 836:           // Add the partition to the list of paths to load block MD.
> Comment only describes the code below, maybe say something like:
Done


Line 868:       loadBlockMetadata(fs, location, filteredParts, fileBlocksToLoad);
> Having separate loadBlockMetadata() and loadDiskIds() doesn't seem to make 
Yea it doesn't make sense. I wanted to refactor that but I forgot. Thanks for pointing it
out.


Line 934:     boolean isMarkedCached = isMarkedCached_;
> looks unused, are you sure nothing got broken here?
Yep, wanted to discuss about this. I'll ping you offline.


Line 953:       try {
> not your change, but you can do this instead:
Done


Line 1583:     boolean isMarkedCached = isMarkedCached_;
> looks unused, sure nothing is broken here?
Same as above. I wanted to discuss this offline.


Line 1628:         totalHdfsBytes_ += hdfsPart.getSize();
> Why can't this stay inside addPartition()? In the loading case the size and
Yea, now that I think about it, it can be in addPartition(). Moved them.


http://gerrit.cloudera.org:8080/#/c/5148/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 347:       LOG.debug("Loading block md for " + name_ + " directory " + dirPath.toString());
> Please add If(LOG.isDebugEnabled()) to avoid calling the dirPath.toString()
Done. Added it in a couple of places that can potentially spam.


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 423:     if (p.isRoot()) return false;
> isn't the equals() below sufficient?
That returns true for isChildPath(root, root)?


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message