impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5429: Multi threaded block metadata loading
Date Wed, 11 Oct 2017 22:10:27 GMT
Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8235
)

Change subject: IMPALA-5429: Multi threaded block metadata loading
......................................................................


Patch Set 5:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@215
PS5, Line 215:   // Block metadata loading stats for a single HDFS path.
nit: File/block (since we're also loading/refreshing files). Also, you may want to change
the name of the private class to reflect that, e.g. PathMetadataLoadingStats or something
like that.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@217
PS5, Line 217: loadedFiles_
You may want to add a comment here. What is loaded vs refreshed? Is the one superset of the
other. Good to clarify to avoid confusion.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@218
PS5, Line 218: _
I believe the convention is that we don't use '_' for public members.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@231
PS5, Line 231: Runnable
I don't know how this is used later on, but alternatively you can make PathBlockMetadataLoadRequest
implement Callable<PathBLockMdLoadingStats>, hence returning the stats when calling
call(). Now you seem to access the stats only through the debugString() function.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@247
PS5, Line 247: Blocks on the loadBlockMetadata() call. 
Not following this comment. run() either calls refreshBlockMetadata() or loadBlockMetadata(),
so I can't really interpret what this comment is saying.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@333
PS5, Line 333: loadBlockMetadata
I know I am guilty for some of these names but maybe rename this to "resetAndLoadBlockMetadata"?
Then it is more clear what the differences are between this function and rerfreshBlockMetadata().


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@363
PS5, Line 363: numUnknownDiskIds
Are you overriding or incrementing the value of numUnknownDiskIds in the create() function?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@368
PS5, Line 368: newFileDescs
Hm, that doesn't seem particularly safe (i.e. using the same list for every partition). Are
we certain that any other partition modification operation (e.g. alter partition set location)
will not try to override the file descriptors, thereby affecting all the other partitions?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@380
PS5, Line 380: changed
             :    * mtime
It's not just the changed mtime that we're looking for in order to determiner modification,
so you may want to either remove this or mention all the criteria we use. I prefer the former.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@383
PS5, Line 383: The initial table load still uses the listFiles()
             :    * on the data directory that fetches both the FileStatus as well as BlockLocations
in
             :    * a single call.
remove


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@398
PS5, Line 398: get(0)
Comment why we take the file descriptors of one partition and that it doesn't really matter
which one we choose.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@399
PS5, Line 399: public String apply(FileDescriptor desc) {
             :               return desc.getFileName();
             :             }
Add @Override and make it a single line (if it fits)


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@448
PS5, Line 448: (fd == null) || (fd.getFileLength() != status.getLen()) ||
             :       (fd.getModificationTime() != status.getModificationTime());
I think we were also checking if the partition was cached. Isn't this check needed anymore?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@455
PS5, Line 455: refreshFileMetadata
Maybe call it refreshPartitionStorageMetadata? Overall, it may make sense to replace "File/Block"
with "Storage" whenever it makes sense. Thoughts?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@694
PS5, Line 694: Exception
Why this change?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@805
PS5, Line 805: ExecutorService partitionLoadingPool = Executors.newFixedThreadPool(threadPoolSize);
What is the benefit of having a one thread pool per loaded table instead of a global thread
pool that all tables can use? In the former approach, it will be hard to impose any constraints
on the maximum number of concurrent load operations. Also, there is some overhead in initializing
and tearing down the thread pool which you pay every time a table is refreshed.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@821
PS5, Line 821: debugString
This will get information from loadingStats but how do you know that loadingStats is not null?
I think it may be safer to check and call debugString in the catch section of run() and don't
make any assumptions about the internal state of PathBlockMetadataLoadRequest here. If you
do that, then you don't even need a Map in L807, only a collection of Futures suffices.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1237
PS5, Line 1237: HashMap
nit: Map


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1303
PS5, Line 1303: Returns the HdfsPartition objects associated with the specified list of partition
              :    * names.
Update comment.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@626
PS5, Line 626: private Table addHdfsPartitions(Table tbl, List<Partition> partitions)
             :       throws CatalogException {
             :     Preconditions.checkNotNull(tbl);
             :     Preconditions.checkNotNull(partitions);
             :     if (!(tbl instanceof HdfsTable)) {
             :       throw new CatalogException("Table " + tbl.getFullName() + " is not an
HDFS table");
             :     }
             :     HdfsTable hdfsTable = (HdfsTable) tbl;
             :     List<HdfsPartition> hdfsPartitions = hdfsTable.createAndLoadPartitions(partitions);
             :     for (HdfsPartition hdfsPartition: hdfsPartitions) {
             :       catalog_.addPartition(hdfsPartition);
             :     }
             :     return hdfsTable;
             :   }
Lot's of duplication with the previous one. How expensive would it be to simply have something
like:
addHdfsPartition(Table tbl, Partition partition) {
 addHdfsPartitions(tbl, List.newArrayList(partition));
}


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java
File fe/src/main/java/org/apache/impala/util/ListMap.java:

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java@59
PS5, Line 59: synchronized (list_) {
Why synchronized on list_? This confuses the locking scheme. If you want you can simply do
synchronized (this). Same in L76.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java@78
PS5, Line 78: list_ = Collections.synchronizedList(list);
Hm, I think this new implementation is way too complicated. I'd start with simply making the
getIndex and populate methods synchronized. If we see this becoming a bottleneck, then try
to optimize it.



-- 
To view, visit http://gerrit.cloudera.org:8080/8235
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Gerrit-Change-Number: 8235
Gerrit-PatchSet: 5
Gerrit-Owner: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Gerrit-Reviewer: Jim Apple <jbapple-impala@apache.org>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <vercegovac@cloudera.com>
Gerrit-Comment-Date: Wed, 11 Oct 2017 22:10:27 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message