impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bharath Vissapragada (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5429: Multi threaded block metadata loading
Date Sun, 22 Oct 2017 20:06:01 GMT
Hello Jim Apple, Dimitris Tsirogiannis, Mostafa Mokhtar, Alex Behm, Vuk Ercegovac, 

I'd like you to reexamine a change. Please visit

to look at the new patch set (#8).

Change subject: IMPALA-5429: Multi threaded block metadata loading

IMPALA-5429: Multi threaded block metadata loading

Implements multi threaded block metadata loading on the Catalog
server where we fetch block metadata for multiple partitions of a
single table in parallel. Number of threads to load the metadata is
controlled by the following two parameters (set on the Catalog server
startup and applies for each table load)


We use different thread pool sizes for HDFS and non-HDFS tables since
non-HDFS supports much higher throughput of RPC calls for listStatus
/listFiles. Based on our experiments, S3 showed a linear speed up
(up to ~113x) with increasing number of loading threads where as the
HDFS throughput was limited to ~5x in un-secure clusters and up to
~3.7x in secure clusters. We narrowed it down to scalability
bottlenecks in HDFS RPC implementation (HADOOP-14558) on both the
server and the client side.

One thing to note here is that the thread pool based metadata fetching
is implemented only for loading HDFS block metadata and not for loading
HMS partition information. Our experiments showed that while loading
large partitioned tables, ~90% of the time is spent in connecting to NN
and loading the HDFS block information and optimizing the rest ~10% makes
the code unnecessarily complex without much gain.

Additional notes:

- The multithreading approach is implemented for
  * INVALIDATE (loading from scratch),
  * REFRESH (reusing existing md) code paths,

- This patch makes the implementation of ListMap thread-safe since
we use that datastructure as a shared state between multiple partition
metadata loding threads.

Testing and Results:

- This patch doesn't add any new tests since there is enough test
coverage already. Passed core/exhaustive runs with HDFS/S3.

- We noticed up to ~113x speed up on S3 tables(thread_pool_size=160)
and up to ~5x speed up in un-secure HDFS clusters and ~3.7x in secure
HDFS clusters.

- Synthesized the following two large tables on HDFS and S3 and noticed
significant reduction in my test DDL queries.

  (1) 100K partitions + 1 million files
  (2) 80 partitions + 250K files

 80-PARTITIONS-250K-FILES-11-REFRESH-PARTITION            I -23.57%
 80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION             I -23.87%
 80-PARTITIONS-250K-FILES-09-INVALIDATE                   I -24.88%
 80-PARTITIONS-250K-FILES-03-RECOVER                      I -35.90%
 80-PARTITIONS-250K-FILES-07-REFRESH                      I -43.03%
 100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH               I -49.02%
 80-PARTITIONS-250K-FILES-05-QUERY-AFTER-INV              I -49.05%
 80-PARTITIONS-250K-FILES-S3-03-RECOVER                   I -67.17%
 80-PARTITIONS-250K-FILES-S3-05-QUERY-AFTER-INV           I -76.45%
 80-PARTITIONS-250K-FILES-S3-07-REFRESH                   I -87.04%

Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
M be/src/catalog/
M be/src/util/
M common/thrift/BackendGflags.thrift
M fe/src/main/java/org/apache/impala/catalog/
M fe/src/main/java/org/apache/impala/catalog/
M fe/src/main/java/org/apache/impala/service/
M fe/src/main/java/org/apache/impala/service/
M fe/src/main/java/org/apache/impala/service/
M fe/src/main/java/org/apache/impala/util/
9 files changed, 461 insertions(+), 245 deletions(-)

  git pull ssh:// refs/changes/35/8235/8
To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Gerrit-Change-Number: 8235
Gerrit-PatchSet: 8
Gerrit-Owner: Bharath Vissapragada <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Jim Apple <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Vuk Ercegovac <>

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message