impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5429: Multi threaded block metadata loading
Date Wed, 11 Oct 2017 22:10:27 GMT
Dimitris Tsirogiannis has posted comments on this change. (

Change subject: IMPALA-5429: Multi threaded block metadata loading

Patch Set 5:

File fe/src/main/java/org/apache/impala/catalog/
PS5, Line 215:   // Block metadata loading stats for a single HDFS path.
nit: File/block (since we're also loading/refreshing files). Also, you may want to change
the name of the private class to reflect that, e.g. PathMetadataLoadingStats or something
like that.
PS5, Line 217: loadedFiles_
You may want to add a comment here. What is loaded vs refreshed? Is the one superset of the
other. Good to clarify to avoid confusion.
PS5, Line 218: _
I believe the convention is that we don't use '_' for public members.
PS5, Line 231: Runnable
I don't know how this is used later on, but alternatively you can make PathBlockMetadataLoadRequest
implement Callable<PathBLockMdLoadingStats>, hence returning the stats when calling
call(). Now you seem to access the stats only through the debugString() function.
PS5, Line 247: Blocks on the loadBlockMetadata() call. 
Not following this comment. run() either calls refreshBlockMetadata() or loadBlockMetadata(),
so I can't really interpret what this comment is saying.
PS5, Line 333: loadBlockMetadata
I know I am guilty for some of these names but maybe rename this to "resetAndLoadBlockMetadata"?
Then it is more clear what the differences are between this function and rerfreshBlockMetadata().
PS5, Line 363: numUnknownDiskIds
Are you overriding or incrementing the value of numUnknownDiskIds in the create() function?
PS5, Line 368: newFileDescs
Hm, that doesn't seem particularly safe (i.e. using the same list for every partition). Are
we certain that any other partition modification operation (e.g. alter partition set location)
will not try to override the file descriptors, thereby affecting all the other partitions?
PS5, Line 380: changed
             :    * mtime
It's not just the changed mtime that we're looking for in order to determiner modification,
so you may want to either remove this or mention all the criteria we use. I prefer the former.
PS5, Line 383: The initial table load still uses the listFiles()
             :    * on the data directory that fetches both the FileStatus as well as BlockLocations
             :    * a single call.
PS5, Line 398: get(0)
Comment why we take the file descriptors of one partition and that it doesn't really matter
which one we choose.
PS5, Line 399: public String apply(FileDescriptor desc) {
             :               return desc.getFileName();
             :             }
Add @Override and make it a single line (if it fits)
PS5, Line 448: (fd == null) || (fd.getFileLength() != status.getLen()) ||
             :       (fd.getModificationTime() != status.getModificationTime());
I think we were also checking if the partition was cached. Isn't this check needed anymore?
PS5, Line 455: refreshFileMetadata
Maybe call it refreshPartitionStorageMetadata? Overall, it may make sense to replace "File/Block"
with "Storage" whenever it makes sense. Thoughts?
PS5, Line 694: Exception
Why this change?
PS5, Line 805: ExecutorService partitionLoadingPool = Executors.newFixedThreadPool(threadPoolSize);
What is the benefit of having a one thread pool per loaded table instead of a global thread
pool that all tables can use? In the former approach, it will be hard to impose any constraints
on the maximum number of concurrent load operations. Also, there is some overhead in initializing
and tearing down the thread pool which you pay every time a table is refreshed.
PS5, Line 821: debugString
This will get information from loadingStats but how do you know that loadingStats is not null?
I think it may be safer to check and call debugString in the catch section of run() and don't
make any assumptions about the internal state of PathBlockMetadataLoadRequest here. If you
do that, then you don't even need a Map in L807, only a collection of Futures suffices.
PS5, Line 1237: HashMap
nit: Map
PS5, Line 1303: Returns the HdfsPartition objects associated with the specified list of partition
              :    * names.
Update comment.
File fe/src/main/java/org/apache/impala/service/
PS5, Line 626: private Table addHdfsPartitions(Table tbl, List<Partition> partitions)
             :       throws CatalogException {
             :     Preconditions.checkNotNull(tbl);
             :     Preconditions.checkNotNull(partitions);
             :     if (!(tbl instanceof HdfsTable)) {
             :       throw new CatalogException("Table " + tbl.getFullName() + " is not an
HDFS table");
             :     }
             :     HdfsTable hdfsTable = (HdfsTable) tbl;
             :     List<HdfsPartition> hdfsPartitions = hdfsTable.createAndLoadPartitions(partitions);
             :     for (HdfsPartition hdfsPartition: hdfsPartitions) {
             :       catalog_.addPartition(hdfsPartition);
             :     }
             :     return hdfsTable;
             :   }
Lot's of duplication with the previous one. How expensive would it be to simply have something
addHdfsPartition(Table tbl, Partition partition) {
 addHdfsPartitions(tbl, List.newArrayList(partition));
File fe/src/main/java/org/apache/impala/util/
PS5, Line 59: synchronized (list_) {
Why synchronized on list_? This confuses the locking scheme. If you want you can simply do
synchronized (this). Same in L76.
PS5, Line 78: list_ = Collections.synchronizedList(list);
Hm, I think this new implementation is way too complicated. I'd start with simply making the
getIndex and populate methods synchronized. If we see this becoming a bottleneck, then try
to optimize it.

To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Gerrit-Change-Number: 8235
Gerrit-PatchSet: 5
Gerrit-Owner: Bharath Vissapragada <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Jim Apple <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Vuk Ercegovac <>
Gerrit-Comment-Date: Wed, 11 Oct 2017 22:10:27 +0000
Gerrit-HasComments: Yes

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message