impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
Date Tue, 22 Nov 2016 05:30:47 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
......................................................................


Patch Set 2:

(18 comments)

High-level comments. Let's get through them first and then I can dig in deeper.

http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 27:  * This class encapsulates the logic required for mapping HDFS
Much fluff, simplify to: Maps HDFS storage-UUIDs to per-host 0-based, sequential disk ids
required by Impala backends.


Line 29:  * by the backend.
required by Impala backends


Line 31: public class DiskIdMapper {
Make this a singleton to clarify the intended use.


Line 36:     // This is intentionally implemented as a static object to share the storageIDs
across
move this part to the class comment


Line 38:     // consistent across all the tables, without which, we can potentially confuse
the
also mention that the static solution requires the least amount of memory


Line 48:     private static ConcurrentHashMap<String, AtomicInteger>
Don't think we need a ConcurrentHashMap or AtomicInteger here, because whenever we update
this entry, we also need to protect storageIdToInt, so we need a coarser-grained lock.


Line 53:      * TODO (Bharath): This method is still prone to some race conditions. Fix it
before
Correct. I don't think the ConcurrentHashMaps will help us because we need to atomically update
both maps sometimes.


Line 71:         storageIdGenerator.put(host, new AtomicInteger(diskId));
Couple of races here like you already pointed out.


http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 294:   // Keeps track of all the parition paths that are filtered to add block metadata.
this comment style for classes and functions

/**
 *
 */


Line 368:         if (!pathMd.filteredParitionPaths.contains(fileStatus.getPath())) continue;
Looks like this contains() and the indexOf() below can become very expensive.

Instead of the 3 arrays in pathMd we should consider using a HashMap from path to a class
that contains the relevant metadata (FD and format).


Line 423:     for (int i=0; i < fileMd.pathFds.size(); ++i) {
i = 0 (add spaces)


Line 458:       // Only build the storage IDs fs instances that support the BlockLocation
API.
... for fs instances ...


Line 720:   private void loadAllPartitions(
After reading through this a few times I'm still confused about the high-level code flow.
It would be great to summarize the high-level steps in the function comment.

My understanding is that we do something like this:
1. createPartition() for every partition; that populates pathMd; internally createPartition()
lists all files in the corresponding partition directory
2. load the block location metadata based on the contents of pathMd

The process above seems a bit backwards to me. In the first step we list all files under the
partition's directory... and then we do the same thing again in step 2.

Intuitively, I was thinking it should look something like this:

1. collect all dirs to call DFS.listFiles() on
2. for each dir, call listFiles(), examine the contents and map them to an msPartition. Based
on that complete info do some filtering/checks and finally createPartition().

Another chat would probably help me understand the ideas here better.


Line 777:         if (dirsToLoad.contains(partDir) ||
positive is more readable to me

if (!dirsToLoad.contains() && isChildPath()) {
  // Partition with a non-default data directory.
  dirsToLoad.add();
}


http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 275:    * Returns true if the filesystem supports BlockLocation API.
What API is it specifically? LocatedFileStatus.getBlockLocations()?


Line 277:   public static boolean supportsBlockLocationAPI(FileSystem fs) {
supportsBlockLocationApi


Line 406:   public static boolean isPathChild(Path p1, Path p2) {
isChildPath(Path p, Path parent)


Line 408:     Path parentDir = p1.getParent();
I think this can be improved by using the Path.getDepth(). You can do getParent() a fixed
number of times and then do one equals() comparison.

(Basically you walk up the path starting at p1 until the parentDir is at the same depth as
p2)


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message