impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
Date Wed, 30 Nov 2016 00:43:10 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Patch Set 11:

File fe/src/main/java/org/apache/impala/catalog/

Line 29:  * ids. This is internally implemented as a static object to share the storageIDs
Several sentences Starting with "This", sometimes not clear what "This" references.

Would be nice to list the reasons in bullet point, something like this.

The storage UUID to disk id is global (shared across tables) for the following reasons:
1. Mention consistent scan range to disk thread assignment
2. Mention memory benefit
3. any more?

Line 44:     // sequential 0-based integer disk id used by the hdfs scanners. This assumes
used by the BE scanners.

Line 46:     private ConcurrentHashMap<String, Integer> storageIdToInt =

Line 50:     // with the corresponding latest 0-based integer ID. Given a storage UUID, we
with -> to

Line 51:     // look in 'storageIdtoInt' map if it already exists, else we generate a new
ID for
no need for this second sentence, that's what we do in getDiskId()

Line 64:       if (Strings.isNullOrEmpty(storageUuid)) return diskId;
let's move this check to the caller and require the storageUuid to be non-null and non-empty
in here

Line 68:       // across the cluster, it is possible that we have a good hit rate.
it is possible -> we expect to have a good hit rate.

Line 72:         // At this point, it could be possible due to race, that storageId might've
// Mapping might have been added by another thread that entered the synchronized block first.

Line 76:         // No mapping exists, create a new disk ID for 'storageId'

Line 80:           // Host hasn't been populated yet. Start with a 0-based id.
// First disk id of this host.
File fe/src/main/java/org/apache/impala/catalog/

Line 289:    * given directory 'dirPath'. Only blocks corresponding to paths from 'partsByPath'
Mention that the block metadata is effectively dropped and re-created.

Please provide a brief summary of the steps involved in this function. something like:
1. Clear block metadata of partitions
2. Call HDFS to list all files
3. etc.

Line 307:       RemoteIterator<LocatedFileStatus> fileStatusIter = fs.listFiles(dirPath,
move right before the while loop

Line 308:       // Reset the current state of partitions under dirPath from perPartitionFileDescMap_
Reset -> Clear

Line 310:       for (Path partDir: partsByPath.keySet()) {
Iterate over the entrySet(), you are using an unnecessary get(partDir) below.

Line 321:         String fileName = fileStatus.getPath().getName();
move closer to where it's used, i.e. right above L336

Line 322:         if (!FileSystemUtil.isValidPartitionFile(fileStatus)) continue;

Line 323:         // At this point, file status points to a valid file under the directory.
Now we
// Find the partition that this file belongs (if any).

Line 325:         Path partPathDir = fileStatus.getPath().getParent();
move closer to where it's used, i.e. L357

Line 330:         // Skip if this file doesn't correspond to any of the existing partitions.
Skip if this file does not belong to any known partition.

Line 332:           LOG.debug("File " + fileStatus.getPath().toString() + " doesn't correspond
" +
add if to check log level

Line 333:               " to a valid partition. Skipping metadata load for this file.");
to a known partition

Line 361:         Preconditions.checkState(perPartitionFileDescMap_.containsKey(partPathDirName));
avoid duplicate lookup with get() + Preconditions.checkNotNull()

Line 363:         // Update the partitions metadata corresponding to this file descriptor
partitions' metadata that this file belongs to.

Line 371:         LOG.warn("Unknown disk id count for filesystem " + fs + ":" + unknownDiskIdCount);
add if to check log level

Line 374:       throw new RuntimeException("Error loading block md for directory "
block metadata

Line 380:    * Loads the disk IDs for BlockLocation 'location' and its corresponding file
Comment should mention the storage UUID to disk id mapping.

Line 386:     String[] hosts;
Do you know if these are the unique hosts? I'm wondering if storageIds and hosts are guaranteed
to have the same length.

Line 394:     for (int i =0; i < storageIds.length; ++i) {
int i = 0; (space after "=")

Line 410:     for (Path partPath: partsByPath.keySet()) {
iterate over Map.Entry, you are doing a get() in the next line

Line 412:       // For each valid file in the partPath, synthesize the block metadata.
For each data file in the partPath...

Line 427:         // We assume all the partitions corresponding to the same partition path,
For the purpose of synthesizing block metadata, we assume that all partitions with the same
location have the same file format.

Line 447:           perPartitionFileDescMap_.put(partition.getLocation(), fileDescMap);
move put() inside the if (fileDescMap == null) block

Line 673:    * Create HdfsPartition objects corresponding to 'msPartitions' and add them to
Briefly list the high-level steps, in particular that we first create the partition objects
based on the HMS partitions, and then in a second step we populate the partitions' file metadata

Line 742:         // Check if the partition directory is a child dir of tbl directory location.
remove, comment only states what the code does

Line 752:     // Load the block metadata
remove comment (doesn't add anything)

Line 764:       throw new CatalogException("Error loading block location md for partition
" +
md -> metadata

Line 827:   public HdfsPartition createPartition(StorageDescriptor storageDescriptor,
change this to createAndLoadPartition()? then we can get rid of the flag in createPartition()
and this function becomes createPartition() + loadMetadataAndDiskIds()
File fe/src/main/java/org/apache/impala/common/

Line 276:    * Returns true if the file corresponding to 'fileStatus' is a valid file as per
is a valid data file

Line 278:    * directory or a hidden file starting with . or _ or if it is a LZO index file.
hidden file is enough, no need to mention specific prefixes

Line 291:   public static boolean supportsBlockLocationApi(FileSystem fs) {

The API part is confusing because we are not really referencing a specific API. If you prefer
to reference the API, then let's reference a specific method.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-HasComments: Yes

View raw message