impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
Date Tue, 22 Nov 2016 05:30:47 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Patch Set 2:


High-level comments. Let's get through them first and then I can dig in deeper.
File fe/src/main/java/org/apache/impala/catalog/

Line 27:  * This class encapsulates the logic required for mapping HDFS
Much fluff, simplify to: Maps HDFS storage-UUIDs to per-host 0-based, sequential disk ids
required by Impala backends.

Line 29:  * by the backend.
required by Impala backends

Line 31: public class DiskIdMapper {
Make this a singleton to clarify the intended use.

Line 36:     // This is intentionally implemented as a static object to share the storageIDs
move this part to the class comment

Line 38:     // consistent across all the tables, without which, we can potentially confuse
also mention that the static solution requires the least amount of memory

Line 48:     private static ConcurrentHashMap<String, AtomicInteger>
Don't think we need a ConcurrentHashMap or AtomicInteger here, because whenever we update
this entry, we also need to protect storageIdToInt, so we need a coarser-grained lock.

Line 53:      * TODO (Bharath): This method is still prone to some race conditions. Fix it
Correct. I don't think the ConcurrentHashMaps will help us because we need to atomically update
both maps sometimes.

Line 71:         storageIdGenerator.put(host, new AtomicInteger(diskId));
Couple of races here like you already pointed out.
File fe/src/main/java/org/apache/impala/catalog/

Line 294:   // Keeps track of all the parition paths that are filtered to add block metadata.
this comment style for classes and functions


Line 368:         if (!pathMd.filteredParitionPaths.contains(fileStatus.getPath())) continue;
Looks like this contains() and the indexOf() below can become very expensive.

Instead of the 3 arrays in pathMd we should consider using a HashMap from path to a class
that contains the relevant metadata (FD and format).

Line 423:     for (int i=0; i < fileMd.pathFds.size(); ++i) {
i = 0 (add spaces)

Line 458:       // Only build the storage IDs fs instances that support the BlockLocation
... for fs instances ...

Line 720:   private void loadAllPartitions(
After reading through this a few times I'm still confused about the high-level code flow.
It would be great to summarize the high-level steps in the function comment.

My understanding is that we do something like this:
1. createPartition() for every partition; that populates pathMd; internally createPartition()
lists all files in the corresponding partition directory
2. load the block location metadata based on the contents of pathMd

The process above seems a bit backwards to me. In the first step we list all files under the
partition's directory... and then we do the same thing again in step 2.

Intuitively, I was thinking it should look something like this:

1. collect all dirs to call DFS.listFiles() on
2. for each dir, call listFiles(), examine the contents and map them to an msPartition. Based
on that complete info do some filtering/checks and finally createPartition().

Another chat would probably help me understand the ideas here better.

Line 777:         if (dirsToLoad.contains(partDir) ||
positive is more readable to me

if (!dirsToLoad.contains() && isChildPath()) {
  // Partition with a non-default data directory.
File fe/src/main/java/org/apache/impala/common/

Line 275:    * Returns true if the filesystem supports BlockLocation API.
What API is it specifically? LocatedFileStatus.getBlockLocations()?

Line 277:   public static boolean supportsBlockLocationAPI(FileSystem fs) {

Line 406:   public static boolean isPathChild(Path p1, Path p2) {
isChildPath(Path p, Path parent)

Line 408:     Path parentDir = p1.getParent();
I think this can be improved by using the Path.getDepth(). You can do getParent() a fixed
number of times and then do one equals() comparison.

(Basically you walk up the path starting at p1 until the parentDir is at the same depth as

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-HasComments: Yes

View raw message